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APPARATUS AND METHOD FOR PRODUCING A LIST OF 
LIKELY CANDIDATE WORDS CORRESPONDING TO A SPOKEN INPUT 

ABSTRACT 

A speech recognition apparatus and method of selecting 
likely words from a vocabulary of words, wherein each 
5 word is represented by a bequence of at least one prob- 
abilistic finite state phone machine and wherein an 
acoustic processor generates acoustic labels in response 
to a spoken input, include: (a) forming a first table 
in which each label in the alphabet provides a vote for 
IQ each word in the vocabulary, each label vote for a sub- 
ject word indicating the likelihood of the subject word 
producing the label providing the vote; (b) forming a 
second table in which each label is assigned a penalty 
for each word in the vocabulary, the penalty assigned 
15 to a given label for a given word being indicative of 
the likelihood of the given label not being produced 
according to the model for the given word; and (c) for 
a given string of labels, determining the likelihood of 
a particular word which includes the step of combining 
20 the votes of all labels in the string for the particular 
word together with the penalties of all labels not in 
the string for the particular word. 
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APPARATUS AND METHOD FOR PRODUCING A 
LIST OF LIKELY CANDIDATE WORDS CORRESPONDING TO A 
SPOKEN UNPUT 

FIELD OF THE INVENTION 
The present invention relates generally to the art 
of speech recognition and specifically to the art of 
forming a short list of likely words selected from a 
vocabulary of words. 

DESCRIPTION OF PRIOR AND CONTEMPORANEOUS ART 
The following cases relate to inventions which 
provide background or environment for the present 
invention: Canadian Patent Application No. 482,183, 
entitled "Nonlinear Signal Processing In A Speech 
Recognition System", filed May 23, 1985; Canadian 
Patent Applciation No. 494,697, entitled "Apparatus And 
Method For Performing Acoustic Matching", filed 
November 6, 1985; and Canadian Patent Application 
No. 496,161, entitled "Feneme Based Markov Models For 
Words", filed November 26, 1985; all of which 
applications have been assigned to International 
Business Machines Corporation. 

In a probabilistic approach to speech recognition, 
an acoustic waveform is initially transformed into a 
string of labels, or fenemes, by an acoustic processor. 
The labels, each of which identifies a sound type, are 
selected from an alphabet of typically approximately 
200 different labels. The generating of such labels 
has been 
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dlscussed in various articles such as "Continuous Speech 
Recognition by Statistical Methods", Proceedings of the 
^EEE, volume 64, pp. 532-556 (1976) and in the co- 
pending patent application entitled "Nonlinear Signal 
Processing in a Speech Recognition System". 

In employing the labels to achieve speech recognition, 
Markov model phone machines (also referred to as a 
probable finite state machines) have been discussed. A 
Markov model normally includes a plurality of states and 
transitions between the states. In addition, the Markov 
model normally has probabilities assigned thereto re- 
lating to (a) the probability of each transition occur- 
ring and (b) the respective probability of producing 
each label at various transitions. The Markov model, or 
Markov source, has been described in various articles 
such as "A Maximum Likelihood Approach to Continuous 
Speech Recognition", IEEE Transactions on Pattern Anal- 
ysis and Machine Intelligence, volume PAMI-5, Number 2, 
March 1983, by L.R. Bahl, F. Jelinek, and R.L. Mercer. 

In recognizing speech, a matching process is performed 
to determine which word or words in the vocabulary has 
the highest likelihood given the string of labels gen- 
erated by the acoustic processor. One such matching 
procedure is set forth in the co-pending apf lication 
entitled "Apparatus and Method for Performing Acoustic 
Matching". As set forth therein, acoustic matching is 
performed by (a) characterizing each word in a vocabu- 
lary by a sequence of Markov model phone machines and 
(b) determining the respective likelihood of each word- 
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representing sequence of phone machines producing the 
string of labels generated by the acoustic processor. 
Each word-representing sequence of phone machines is 
referred to as a word basefonn. 

As discussed in the application entitled "Apparatus and 
Method for Performing Acoustic Matching", the word 
baseform may be constructed of phonetic phone machines. 
In this instance, each phone machine preferably corre- 
sponds to a phonetic sound and includes seven states and 
thirteen transitions. 

Alternatively, each word baseform may be formed as a 
sequence of fenemic phone machines.. A fenemic phone ma- 
chine is a simpler Markov model than the phonetic phone 
machine and is preferably constructed with two states. 
Between the first state and the second state is a null 
transition in which no label can be produced. Also be- 
tween the first state and the second state is a non-null 
transition in which a label from the alphabet of labels 
may be produced. At the first state is a self-loop at 
which labels may be produced. During a training phase, 
statistics for each fenemic phone machine are defined. 
That is, for each fenemic phone, a probability for each 
transition and a probability for each label being 
produced at each non-null transition are computed from 
known utterances. Word baseforms are formed by concat- 
enating fenemic phone machines. 

In systems employing fenemic baseforms and phonetic 
baseforms, the end goal is to find the most likely word 
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or words corresponding to a string of labels generated 
by an acoustic processor in response to ar unknown 
speech input. One approach to achieving this goal is 
discussed in the above-cited application "Apparatus and 
Method for Performing Acoustic Matching". Specifically, 
the number of words is first reduced from the total 
number of words in the vocabulary to a list of likely 
candidate words which may be examined in a detailed 
match procedure and/or a language model procedure at 
which preferably the most lilcely word is selected. In 
reducing the number of candidate words, the methodology 
taught in the co-pending application involves applying 
approximations to the phone machines which result in 
faster processing without inordinate computational re- 
quirements. In achieving the reduction in words, the 
approximate acoustic match performs computations through 
a match lattice. The approximate acoustic match has 
proved to be effective and efficient in determining 
which words should be subjected to detailed matching 
and, as appropriate, processing in accordance with the 
language model. 

SUMMARY OF THE INVENTION 

The present invention teaches a different method of re- 
ducing the number of words that are to be examined in a 
detailed match. That is, the present invention relates 
to a polling method wherein a table is set up in which 
each label in the alphabet provides a "vote" for each 
word in the vocabulary. The vote reflects the likelihood 
that given word has produced a given label. The votes 
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are computed from the label output probability and 
transition probability statistics derived during a 
training session. 

In accordance with one embodiment of the invention, when 
a string of labels is generated by an acoustic 
processor, a subject word is selected. From the vote 
table, each label in the string Is recognized and the 
vote of each label corresponding to the subject word Is 
determined. All votes of the labels for the subject word 
are accumulated and combined to provide a likelihood 
score. Repeating the process for each word in the vo- 
cabulary results in likelihood scores for each word. A 
list of likely candidate words may be derived from the 
likelihood scores. 

In a second embodiment, a second table is also formed 
which Includes a penalty that each label has for each 
word in the vocabulary. A penalty assigned to a given 
label indicates the likelihood of a word not producing 
the given label. In the second embodiment, both label 
votes and penalties are considered in determining the 
likelihood score for a given word based on a string of 
labels. 

In order to account for length, the likelihood scores 
are preferably scaled based on the number of labels 
considered In evaluating a likelihood score for a word. 

Moreover,- when the end time for a word is not defined 
along the string of generated labels, the present In- 
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vent ion provides that likelihood scores be computed at 
successive time intervals so that a subject word may 
have a plurality of successive likelihood scores asso- 
ciated therewith. The invention further provides that 
the best likelihood score for the. subject word --when 
compared relative to the likelihood scores of preferably 
all the other words in the vocabulary-- is assigned to 
the subject word. 

In accordance with the invent:ion, a method is taught for 
selecting likely words from a vocabulary of words 
wherein each word is represented by a sequence of at 
least one probabilistic finite state phone machine and 
wherein an acoustic processor generates acoustic labels 
in response to a spoken input. The method comprises the 
steps of: (a) forming a first table in which each label 
in the alphabet provides a vote for each word in the 
vocabulary, each label vote for a subject word indicat- 
ing the likelihood of the subject word producing the 
label providing the vote. Further, the method prefera- 
bly comprises the steps of: (b) forming a second table 
in which each label is assigned a penalty for each word 
in the vocabulary, the penalty assigned to a given label 
for a given word being indicative of the likelihood of 
the given label not being produced according to the 
model for the given word; and (c) for a given string of 
labels, determining the likelihood of a particular word 
which includes the step of combining the votes of all 
labels in the string for the particular word together 
with the penalties of all labels not in the string for 
the particular word. 
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Still further the method preferably comprises the addi- " 
tional step of: (d) repeating steps (a), (b), and (c) 
for all words as the particular word in order to provide 
a likelihood score for each word* 

If desired, the methodology discussed above is employed 
in conjunction with the approximate acoustic matching 
techniques set forth in the co-pending application "Ap- 
paratus and Method for Perfoi;ming Acoustic Matching" and 
discussed hereinbelow. 

The present invention provides a fast, computationally 
simple, effective technique for determining which words 
in a vocabulary have a relatively high likelihood of 
corresponding to a string of acoustic labels generated 
by an acoustic processor. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FlJ.l is a general block diagram of a system environment 
in which the present invention may be practiced. 

FIG. 2 is a block diagram of the system environment of 
FIG.l wherein the stack decoder is shown in greater 
detail. 

FIG. 3 is an illustration of a detailed match phone ma- 
chine which is identified in storage and represented 
therein by statistics obtained during a training ses- 
sion, 

FIG. 4 is an illustration depicting the elements of an 
acoustic processor. 

FIG. 5 is an illustration of a typical human ear indi- 
cating where components of an acoustic model are de- 
fined. 

FIG. 6 is a block diagram showing portions of the acous- 
tic processor. 

FIG. 7 is a graph showing soiind intensity versus fre- 
quency » the graph being used in the design of the 
acoustic processor. 

FIG. 8 is a graph showing the relationship between sones 
and phons . ' ' 
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FIG. 9 is a flowchart representation showing how sound 
is characterized according to che acoustic processor of 
FIG. 4. 

FIG. 10 is a flowchart representation showing how 
thresholds are up-dated in FIG. 9. 

FIG. II is a trellis diagram, or lattice, of a detailed 
match procedure. 

FIG. 12 is a diagraca depicting a phone oachine used in 
performing matching. 

FIG. 13 is a time distribution diagram used in a matching 
procedure having certain imposed conditions. 

FIG. 14 (a) through (e) are diagrams which show the 
interrelationship between phones, a label string, and 
start and end times determined in the matching proce- 
dure. 

FIG. 15 (a) is a diagram showing a particular phone ma- 
chine of minimum length zero and FIG. 15 (b) is a time 
diagram corresponding thereto. 

FIG. 16 (a) is a phone machine corresponding to a minimum 
length four and FIG. 16 (b) is a time diagram corre- 
sponding thereto. 
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FIG.17 is a diagram illustrating a tree str^icture of 
phones which permit processing of multi^de words cItiuI- 
taneously. 

FIG. 18 is a flowchart outlining the steps performed in 
training phone machines for performing acoustic match- 
ing. 

FIG. 19 is an illustration showing successive steps of 
stack decoding. 

FIG. 20 is a graph depicting likelihood vectors for re- 
spective word paths and a likelihood envelope. 

FIG, 21 is a flowchart representing steps in a stack de- 
coding procedure. 

FIG. 22 is a flowchart showing how a word path is extended 
with words obtained from acoustic matching. 

FIG. 23 is an illustration showing a fenemic phone ma- 
chine . 

FIG. 24 is a trellis diagram for a sequential plurality 
of fenemic phone machines. 

FIG. 25 is a flowchart illustrating for the polling 
methodology of the present invention, 

FIG. 26 is a graph illustrating the count distribution 
for labels. 
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FIG.27 Is a graph illustrating the number of times rsach 
phone produced each label during training. 

FIG. 28 is an illustration showing the sequence of phones 
constituting each of two words. 

FIG. 29 is a graph illustrating the expected number of 
counts for a word for each label. 

FIG. 30 is a table showing the vote of each label for each 
word. 

FIG. 31 is a table showing the penalty of each label for 
each word. 

FIG. 32 is a block diagram illustrating apparatus of the 
invention. 
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DESCRIPTION OF A PREFERRED EHBODIMSNT OF THE INVENTION 
(I) Speech Recognition System jSnvironment 
A. General Description 

In FIG.l, a general block diagram of a speech recogni- 
tion system 1000 is illustrated. The system 1000 in- 
cludes a stack decoder 1002 to which are connected an 
acoustic processor (AP) 1004^, an array processor 1006 
used in performing a fast approximate acoustic match, 
an array processor 1008 used in performing a detailed 
acoustic match,. a language model 1010, and a work sta- 
tion 1012. 

The acoustic processor 1004 is designed to transform a 
speech waveform input into a string of labels, or 
fenemes, each of which in a general sense identifies a 
corresponding sound type. In the present system, the 
acoustic processor 1004 is based on a unique model of 
the human ear. and is described in the above-mentioned 
application entitled "Nonlinear Signal Processing in a 
Speech Recognition System". 

The labels, or fenemes, from the acoustic processor 1004 
enter the stack decoder. 1002. In a logical sense, the 
stack decoder 1002 may be represented by the elements 
shown in FIG. 2. That is, the stack decoder 1002 includes 
a search element 1020 which communicates with the work 
station 1012 and which communicates with the acoustic 
processor process, the fast match processor process, the 
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detailed match process, and the lannv.age model process 
through respective intpcfaces 1022, 1024, 1026, and 
1028. 

In operation, fenemes from the acousti«^ processor 1004 
are directed by the search element 1020 to the fast match 
processor 1006. The fast match procedure is described 
hereinbelow as well as in the application entitled "Ap- 
paratus and Method for Performing Acoustic Matching". 
Briefly, the object of matching is to determine the most 
likely word (or words) for a given string of labels. 

The fast match is designed to examine words in a vocab- 
ulary of words and to reduce the number of candidate 
words for a given string of incoming labels. The fast 
match is based on probabilistic finite state machines, 
also referred to herein as Markov models. 

Once the fast match reduces the number of candidate 
words, the stack decoder 1002 communicates with the 
. language model 1010 which determines the contextual 
likelihood of each candidate word in the fast match 
candidate list based preferably on existing tri-grams. 

Preferably, the detailed match examines those words from 
the fast rjatch candidate list which have a reasonable 
likelihood of being the spoken word based on the lan- 
guage model computations. The detailed match is dis- 
cussed in the above-mentioned application entitled 
"Apparatus and Method for Performing Acoustic Matching". 
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The detailed match Is performed by means of Markov model 
phone machines such as the machine illustrated in FIG. 3. 

After the detailed match, the language model is, pref- 
erably, again invoked to determine word likelihood. The 
stack decoder 1002 of the present invention --using in- 
formation derived from the fast matching, detailed 
matching, and applying the language model — is designed 
to determine the most likely path, or sequence, of words 
for a string of generated labels. 

Two prior art approaches for finding the most likely 
word sequence are Viterbi decoding and single stack de- 
coding. Each of these techniques are described in the 
article entitled "A Maximum Likelihood Approach to Con- 
tinuous Speech Recognition." Viterbi decoding is de- 
scribed in section V and single stack decoding in 
section VX of the article. 

In the single stack decoding technique, paths of varying 
length are listed in a single stack according to like- 
lihood and decoding is based on the single stack. Single 
stack decoding must account for the fact that likelihood 
is somewhat dependent on path length and, hence, nor- 
malization is generally employed. 

The Viterbi technique does not requiring normalization 
and is generally practical for small tasks. 

As another alternative, decoding may be performed with 
a small vocabulary system by examining each possible 
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combination of words as a possible word sequence and 
determining which combination has the highest probabil- 
ity of producing the generated label string. The compu- 
tational requirements for this technique become 
impractical for large vocabulary systems. 

The stack decoder 1002, in effect, serves to control the 
other elements but does not perform many computations. 
Hence, the stack decoder 1002 preferably includes a 4341 
running under the IBH*VM/370 operating system as de- 
scribed in publications such as Virtual Machine/System 
Product Introduction Release 3 (1983). The array 
processors which perform considerable computation have 
been implemented with Floating Point System (FPS) 
190L's, which are commercially available. 

A novel technique which includes multiple stacking and 
a unique decision strategy for determining the best word 
sequence, or path, has been invented by L.R. Bahl, 
F.Jelinek, and R.L. Mercer and is discussed 
hereinbelow. 

B." The Auditory Model and Implementation Thereof in an 
Acoustic Processor of a Speech Recognition System . 

In FIG. 4 a specific embodiment of an acoustic processor 
1100, as described above, is illustrated. An acoustic 
wave input (e.g., natural speech) enters an analog-to- 
digital converter 1102 which samples at a prescribed 
rate. A typical sampling rate is one sample every 50 
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microseconds. To shape the edges of the digital signal, 
a time window generator 1104 is provided. The output 
of the window 1104 enters a fast Fourier transform (FFT) 
element 1106 which provides a frequency spectrum output 
for each time window. 

The output of the FFT element 1106 is then processed to 
produce labels ^z'^'^V elements — a feature 

selection element 1108, a cluster element 1110, a pro- 
totype element 1112, and a labeller 1114 — coact to 
generate the labels. In generating the labels, proto- 
types are defined as points (or vectors) in the space 
based on selected features and acoustic inputs, are then 
characterized by the same selected features to provide 
corresponding points (or vectors), in space that can be 
compared to the prototypes. 

Specifically, in defining the prototypes, sets of points 
are grouped together as respective clusters by cluster 
element 1110. Methods for defining clusters have been 
based on probability distributions --such as the 
Gaussian distribution— applied to speech. The prototype 
of each cluster --relating to the centroid or other 
characteristic of the cluster-- is generated by the 
prototype element 1112. The generated prototypes and 
acoustic input --both characterized by the same selected 
features— enter the labeller 1114. The labeller 1114 
performs a comparing procedure which results in assign- 
ing a label to a particular acoxistic input. 
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The selection of appropriate features is a key factor 
in deriving labels which represent the acoustic (speech) 
wave input. The presently described acoustic processor 
includes an improved feature selection element U08. 
In accordance with the present acoustic processor, an 
auditory model is derived and applied in an acoustic 
processor of a speech recognition system. In explaining 
the auditory model, reference is made to FIG.5. 

FIG. 5 shows part of the Inner human ear. Specifically, 
an inner hair cell 1200 is shown with end portions 1202 
extending therefrom into a fluid- containing channel 
1204. Upstream from inner hair cells are outer hair 
cells 1206 also shown with end portions 1208 extending 
into the channel 120A. Associated with the Inner hair 
cell 1200 and outer hair cells 1206 are nerves which 
convey information to the brain. Specifically, neurons 
undergo electrochemical changes which result in elec- 
trical impulses being conveyed along a nerve to the 
brain for processing. Effectuation of the 
electrochemical changes, is stimulated by the mechanical 
motion of the basilar membrane 1210. 

It has been recognized, in prior teachings, that the 
basilar membrane 1210 serves as a frequency analyzer for 
acoustic waveform Inputs and that portions along the 
basilar membrane 1210 respond to respective critical 
frequency bands. That different portions of the basilar 
membrane 1210 respond to corresponding frequency bands 
has an impact on the loudness perceived for an acoustic 
waveform input. That Is, the loudness of tones is per- 
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ceived to be greater when two tones are in different 
critical frequency bands than when two tones of similar 
power Intensity occupy the same frequency band. It has 
been found that there are on the order of twenty-two 
critical frequency bands defined by the basilar membrane 
1210. 

Conforming to the frequency-response of the basilar 
membrane 1210, the present acoustic processor 1100 in 
its preferred form physically defines the acoustic 
waveform input into some or all of the critical fre- 
quency bands and then examines the signal component for 
each defined critical' frequency band separately. This 
function is achieved by appropriately filtering the 
signal from the FFT element 1106 (see FIG. 5) to provide 
a separate signal in the feature selection element 1108 
for each examined critical frequency band. 

The separate inputs, it is noted, have also been blocked 
into time frames (of preferably 25.6 msec) by the time 
window generator 1104. Hence, the feature selection 
element 1108 preferably includes twenty- two signals -- 
each of which represents sound intensity in a given 
frequency band for one frame in time after another. 

The filtering is preferably performed by a conventional 
critical band filter 1300 of FIG. 6. The separate 
signals are then processed by an equal loudness con- 
verter 1302 which accounts for perceived loudness vari- 
ations as a function of frequency. In this regard, it 
is noted that a first tone at a given dB level at one 
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frequency may differ in perceived loudness from a second 
tone at the same given dB level at a second frequency. 
The c--nverter 1302 can be based on empirical data, con- 
verting the signals in the various frequency bands so 
that each is measured by a similar loudness scale. For 
example, the converter 1302 preferably map from acoustic 
power to equal loudness based on studies of Fletcher and 
Munson in 1933, subject to certain modifications. The 
modified results of these studies are depicted in FIG. 5. 
In accordance with FIG. 5. a.lKHz tone at 40dB is compa- 
rable in loudness level to a lOOHz tone at 60dB as shown 
by the X in the figure. 

The converter 1302 adjusts loudness preferably in ac- 
cordance with the contours of FIG. 5 to effect equal 
loudness regardless of frequency. 

In addition to dependence on frequency, power changes 
and loudness changes do not correspond as one looks at 
a single frequency in FIG. 5. That is. variations in the 
sound intensity, or amplitude, are not at all points 
reflected by similar changes in perceived loudness. For 
example, at 100 Hz, the perceived change in loudness of 
a lOdB change at about llOdB is much larger than the 
perceived change in loudness of a lOdB change at 20dB. 
This difference is addressed by a loudness licaling ele- 
ment 1304 which compresses loudness in a predefined 
fashion. Preferably, the loudness scaling element com- 
presses power P by a cube-root factor to P^^^ by re- 
placing loudness amplitude measure in phons by sones. 
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FIG. 7 Illustrates a known representation of phons versus 
sones determined empirically. By employing sones, the 
present model remains substantially accurate at large 
speech signal amplitudes. One sone, it should be re- 
cognized, has been defined as the loudness of a IKHz tone 
at 40dB. 

Referring again to FIG. 6, a novel time varying response 
element 1306 is shown which acts on the equal loudness, 
loudness scaled signals assqciated with each critical 
frequency band. Specifically, for each frequency band 
examined, a neural firing rate f is determined at each 
time frame. The firing rate f is defined in accordance 
with the present processor as: 

f = (So + DL)n (1) 

where n is an amount of neurotransmitter; So is a spon- 
taneous firing constant which relates to neural firings 
Independent of acoustic waveform input; L is a measure- 
ment of loudness; and D is a displacement constant. 
(So)n corresponds to the spontaniaous neural firing rate 
which occurs whether or not there is an acoustic wave 
input and DLn corresponds to the firing rate due to the 
acoustic wave input. 

Significantly, the value of n is characterized by the 
present acoustic processor as changing over time ac- 
cording to the relationship: 
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dn/dt = Ao - (So + Sh + DL)n (2) 



where Ao Is a replenishment constant and Sh is. a spon- 
taneous neurotransmitter decay constant. The novel re- 
lationship set forth in equation (2) takes into account 
that neurotransmitter is being produced at a certain 
rate (Ao) and is lost (a) through decay (Sh x n) , (b) 
through spontaneous firing (So x n) , and (c) through 
neural firing due to acoustic wave input (DL x n) The 
presumed locations of these modelled phenomena are il- 
lustrated in FIG. 5. 



Equation (2) also reflects the fact that the present 
acoustic processor is non-linear in that the next amount 
of neurotransmitter and the next firing rate are de- 
pendent multiplicatively on the current conditions of 
at least the neurotransmitter amount. That is, the 
amount of neurotransmitter at time (t+At) is equal to 
the amount of neurotransmitter at time t plus dn/dtAt, 
or: 



n(t+At)^(t)+dn/dt At (3) 

Equations (L), (2), and (3) describe a time varyijig 
signal analyzer which, it is suggested, addresses the 
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fact that the auditory system appears to be adaptive 
over time, causing signals on the auditory nerve to be 
non-linearly related to acoustic wave input. In this 
regard, the present acoustic processor provides the 
first model which embodies non-linear signal processing 
in a speech recognition system, so as to better conform 
to apparent time variations in the nervous system. 



In order to reduce the number of unknowns in equations 
(1) and (2), the present acoustic processor uses the 
following equation (A) which applies to fixed loudness 
L: 



So + Sh + DL = 1/T 



T is a measure of the time it takes for an auditory re- 
sponse to drop to 37% of its maximum after an audio wave 
input is generated. T, it is noted, is a function of 
loudness and is, according to the present acoustic 
processor, derived from existing graphs which display 
the decay of the response for various loudness levels. 
Tht.t is, when a tone of fixed loudness is generated, it 
generates a response at a first high level after which 
the response decays toward a steady condition level with 
a time constant T. With no acoustic wave input, T>^^ 
which is on the order of 50 msec. For a loudness of 
^max' ^max ^^^^^ order of 30 msec. By set- 
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tlng Ao = 1, l/(So + Sh) is determined to be 5 csec, when 
L = 0. When L Is L^^ and L^^^ = 20 sones, equation (5) 



results: 



So + Sn + D(20) = 1/30 (5) 



With the above data and equations, So and Sh are defined 
by equations (6) and (7) as:. 



Sh « 1/T^ - So (7) 

where 

Steady state' ^ ^^x 
R = (8) 

^steady state' ^ ° 

^steady state' ^represents the firing rate at a given 
loudness when dn/dt is zero. 

R> it is noted, is the only variable left in the acoustic 
processor. Hence, to alter the performance of the 
processor, only R is changed. R, that is, is a single 
parameter which may be adjusted to alter performance 
which, normally, means minimizing steady state effects 
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relative to transient effects. It is desired to mini- 
raize steady state effects because inconsistent output 
patterns for similar speech inputs generally result from 
differences in frequency response, speaker differences, 
background noise, and distortion which affect the steady 
state portions of the speech signal but not the tran- 
sient portions. The value of R is preferably set by 
optimizing the error rate of the complete speech recog- 
nition system. A suitable value found in this way is R 
» 1.5. Values of So and Sh are then 0.0888 and 0.11111 
respectively, with D being derived as 0.00666, 



Referring to FIG. 9, a flowchart of the present acoustic 
processor is depicted. Digitized speech in a 25.6 msec 
time frame, sampled at preferably 2OKH2 passes through 
a Hanning Window 1320 the output from which is subject 
to a Fourier Transform 1322, taken at preferably 10 msec 
intervals. The transform output is filtered by element 
1324 to provide a power density output for each of at 
least one frequency band — preferably all the critical 
frequency bands or at least twenty thereof. The power 
density is then transformed from log magnitude 1326 to 
loudness level. This is readily performed according to 
the modified graph of FIG. 7. The process outlined here- 
after which includes threshold up-dating of step 1330 
is depicted in FIG, 10. 

In FIG. 10, a threshold-of- feeling T^ and a threshold- 
of -hearing are initially defined (at stop 1340) for 
each filtered frequency band m to be 120dB and OdB re- 
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spectively. Thereafter, a speech counter, total frames 
register, and a histogram register are reset at step 
1342. 



Each histogram includes bins, each of which indicates 
the number of samples or counts durdLng which power or 
some similar measure — in a given frequency band - Is 
in a respective range. A histogram in the present in- 
stance preferably represents - for each given frequency 
band ^. the number of centiseconds during which loudness 
is in each of a plurality of loudness ranges. For ex- 
ample, in the third frequency band, there may be twenty 
centiseconds between lOdB and 20dB in power. Similarly, 
in the twentieth frequency band, there may be one hun- 
dred fifty out of a total of one thousand centiseconds 
between SOdB and 60dB. From the total number of samples 
(or centiseconds) and the counts contained in the bins,, 
percentiles are derived. 



A frame from the filter output of a respective frequency 
band is examined at step 1344 and bins in the appropriate 
histograms — one per filter — are incremented at step 
1346. The total number of bins in which the amplitude 
exceeds 55dB are summed for each filter (i.e. frequency 
band) at step 1348 and the number of filters indicating 
the presence of speech is determined. If there is not 
a minimum of filters (e.g. six of twenty) to suggest 
speech, the next frame is examined at step 1344. If 
there are enough filters to indicate speech at step 
1350, a speech counter is incremented at step 1352. The 
speech counter is incremented at step 1352 until 10 
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seconds of speech have occurred at step 1354 whereupon 
new values for^T^ and are defined for each filter at 
step 1356. 

The new and 7^ values are determined for a given 
filter as follows. For T^, the dB value of the bin 
holding the 35th sample from the top of 1000 bins (i.e. 
the 96.5th percentile of speech) is defined as BINjj. 

is then set as: BINjj+ 40dB. For 7^, the dB value 

of the bin holding the (.01). (TOTAL BINS - SPEECH COUNT) 
th value from the lowest bin is defined as BIN^^. That 
is, BINj^ is the bin in the histogram which is 1% of the 
number of samples in the histogram excluding the number 
of samples classified as speech. T^ is then defined as: 
Tj^=» BINj^- 30dfl. 

Returning to FIG. 9, the sound amplitudes are converted 
to sones and scaled based on the updated thresholds 
(steps 1330 and 1332) as described hereinbefore. An 
alternative method of deriving sones and scaling is by 
taking the filter amplitudes "a" (after the bins have 
been incremented) and converting to dB according to the 
expression: 

dB 

a =20 Iog^Q(a)-10 (9) 

Each filter amplitude is then scaled to a range between 
0 and 120 to provide equal loudness according to the 
expression: 



a^*l^=:120(a^^-T^)/(T^-Tj^) (lO) 
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a Is then preferably converted from a loudnesr level 
(phons) to an approximation of loudness in sones (with 
a IKHz signal at AOdB mapping to 1) by the expression: 

Loudness In sones is then approximated as: 

Lg(appr)=10(L''^)/20 (i2) 

The loudness in sones Is then provided as input to 
the equations (1) and (2) at step 1334 to determine the 
output firing rate f for each frequency band (step 
1335). With twenty-two frequency bands, a twenty-two 
dimension vector characterizes the acoustic wave Inputs 
over successive time frames. Generally, however, twenty 
frequency bands are examined by employing a conventional 
mel-scaled filter bank. 



Prior to processing the next time frame (step 1336), the 
next state of n is determined in accordance with 
equation (3) in step 1337. 



The acoustic processor hereinbefore described is subject 
to improvement in applications where the firing rate f 
and neurotransmitter ajnount n have large DC pedestals. 
That is, where the dynamic range of the terms of the f 
and n equations is important, the following equations 
are derived to reduce the pedestal height. 



1246229 



y0984-021 



-29- 



In the steady i;tate, and In the absence of an acoustic 
wave input signal (L = 0), equation (2) can be solved 
for a steady-state internal state n*: 

n'= A/(So + Sh) (13) 

The internal state of the neurotransmitter amount n(t) 
can be represented as a steady state portion and a 
varying portion: 

n(t) = n'+ n"(t) (14) 

Combining equations (1) and (14), the following ex- 
pression for the firing rate results: 

f (t) = (So + D X L) (n'+ n"(t)) (is) • 

The term So x n* is a constant, while all other terms 
include either the varying part of n or the Input signal 
represented by (D x L) . Future processing will involve 
only the squared differences between output vectors, so 
that processing will involve only the squared differ- 
ences between output vectors, so that constant terms may 
be disregarded. Including equation (13) for n' , we get 

f"(t) = (So + D X L) x((n"(t)+DxLxA)/(So+Sh)) (16) 

Considering equation (3), the next state becomes: 
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n(t + At) = n'(t + At) + n"(t + At) (i7) 

=n"(t)+A-(So + Sh + D X L) X (n'+ n"(t)) (j,) 

= n"(t) - (Sh X n"(t) - (So + Ao x L*) n"(t) 

-(Ao X A D)/(So + Sh) + Ao - ((So x Ao)+(Sh x 
Ao))/(So+Sh) 
(19) 

This equation (19) may be rewritten. Ignoring all con- 
Stant terms » as: 



n"(t + At) = n"(t)(l - So At) - f"(t) 



(20) 



Equations (15) and (20) now constitute the output 
equations and state-update equations applied to each 
filter during each 10 millisecond time frame. The re- 
sult of applying these equations Is a 20 element vector 
each 10 milliseconds, each element of the vector corre- 
sponding to a firing rate for a respective frequency 
band In the roel-scaled filter bank. 

With respect to the embodiment set forth immediately 
hereinabove, the flowchart of FIG. 9 applies except that 
the equations for f, dn/dt, and n(t+l) are replaced by 
equations (U) and (16) which define special case ex- 
pressions for firing rate f and next state n (t+At) re- 
spectively. 

It is to be noted that the values attributed to the terms 

in the various equations (namely t = 5 csec, t = 

o L 

max 
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3csec, Ao = 1, R = 1.5, and L a 20) may be set other- 
wise and the terms So, Sh, and D may differ from the 
preferable derived values of 0.0888, 0.11111, and 
0,00666, respectively, as other terms are set differ- 
ently. 

The present acoustic model has been practiced using the 
PL/ 1 programming language with Floating Point Systems 
FPS 190L hardware, however may be practiced by various 
other software or hardware approaches. 

C. Detailed Match 

In FIG. 3, a sample detailed match phone machine 2000 is 
depicted. Each detailed match phone machine is a prob- 
abilistic finite-state machine characterized by (a) a 
plurality of states S^, (b) a plurality of transitions 
tr(Sj|S^), some of the transitions extending between 
different states and some extending from a state back 
to itself, each transition having associated therewith 
a corresponding probability, and (c) for each label that 
can be generated at a particular transition, a corre- 
sponding actual label probability. 

In FIG. 3, seven states S^ through S^ are provided and 
thirteen transitions trl through trl3 are provided in 
the detailed match phone machine 2000. A review of FIG. 3 
shows that phone machine 2000 has three transitions with 
dashed line paths, namely transitions trll, trl2, and 
trl3. At each of these three transitions, the phone can 
change from one state to another without producing a 
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label and such a transition is, accordingly, referred 
to as a null transition. Along transitions trl through 
trio labels can be produced. Specifically, along each 
transition trl through trlO, one or more labels may have 
a distinct probability of being generated thereat. 
Preferably, for each transition there is a probability 
associated with each label that can be generated in the 
system. That is, if there are two hundred labels that 
can be selectively generated by the acoustic channel, 
each transition (that is not; a null) has two hundred 
"actual label probabilities" associated therewith —each 
of which corresponds to the probability that a corre- 
sponding label is generated by the phone at the partic- 
ular transition. The actual label probabilities for 
transition trl are represented by the symbol p followed 
by the bracketed column of numerals 1 through 200, each 
numeral representing a given label. For label 1, there 
is a probability pfl) t the detailed phone machine 2000 
generates the label 1 at transition trl. The various 
actual label probabilities are stored with relation to 
the label and a corresponding transition. 

When a string of labels yiy^ys™ is presented to a de- 
tailed match phone machine 2000 corresponding to a given 
phone, a match procedure is performed. The procedure 
associated with the detailed match phone machine is ex- 
plained with reference to FIG. 11. 

FIG, 11 is trellis diagram of the phone machine of FIG. 3. 
As in the phone machine, the trellis diagram shows a null 
transition from state S to state S- and transitions 
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from state to state and fron. state S. to state S^. 
The transitions between other states are also lllus- 
trated. The trellis diagram also shows tl.e measured in 
the horizontal direction. Start-tl«e probabilities q 
and represent the probabilities that a phone has a 
start time at tl«e t=t„ or t=tj. respectively, for the 
phone. At each start time t^, and t^. the various tran- 
sltions are shown. Jt should be noted, to this regard, 
that the Interval between successive start (and end) 
times is preferably equal In, length to the time Interval 
of a label. 

In employing the detailed match phone machine 2000 to 
determine how closely a given phone matches the labels 
of an incoming string, an end-time distribution for the 
phone Is sought and used in determining a match value 
for the phone. The notion of relying on the end-time 
distribution is common to all embodiments of phone ma- 
chines discussed herein relative to a matching proce- 
dure. In generating the end-time distribution to 
perform a detailed match, the detailed match phone ma- 
chine 2000 involves computations which are exact and 
complicated. 

Looking at the trellis diagram of FIG. 11, „c first con- 
sider the computations required to have both a start 
time and end time at time t=t^. For this to be the case 
according to the example phone machine structure set 
forth in FIG. 3. the following probability applies: 

Pr(Sy;t=tp) = qg^tTCl-^y) + Pr(s t=t )xT(2-»7) + 
(21) 
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where Pr represents "probability of" and T represents 
the transition probability between the two parenthet- 
ically Identified states. The above equation indicates 
the respective probabilities for the three conditions 
under which the end time can occur at time t^t^. More- 
over, it is observed that the end time at t^t^ is limited 
in the current example to occurrence at state S^. 

Looking next at the end time t^t^, it is noted that a 
calculation relating to every state other than state 
must be made. The state starts at the end time of the 
previous phone. For purposes of explanation, only the 
calculations pertaining to state are set forth. 

For state S^, the calculation is : 

Pr(S^,t=tj)=i Pr(S^,t=tQ)xT(l--4)xPr(yj(l-^4) + 
Pr(S^,t=n:Q)xT(4M)xPr(yjl4->4) (22) 

In words, the equation (22) set forth immediately above 
indicates that the probability of the phone machine be- 
ing in state at time t=t^ is dependent on the sum of 
the following two terms (a) the probability of being at 
state at time t=tQ multiplied by the probability (T) 
of the transition from state to state multiplied 
further by the prolyability (Pr) of a given label y^ being 

generated given a transition from state S, to state S, 

1 4 

and (b) the probability of being at state S, at time t=t 

4 0 
multiplied by the probability of the transition from 

state to itself and further multiplied by the proba- 
bility of generating the given label y^ during and given 
the transition from state S, to itself. 
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Siwilarly, calculations pertaining to the other states 
(excluding state S^) are also performed to generate 
corresponding probabilities that the phone is at a par- 
ticular state dt time t=^j* Generally, in determining 
the probability of being at a subject state at a given 
time, the detailed match (a) recognizes each previous 
state that has a transition which leads to the subject 
state and the respective probability of each such pre- 
vious state; (b) recognizes,, for each such previous 
state, a value representing the probability of the label 
that must be generated at the transition between each 
such previous state and the current state in order to 
conform to the label string; and (c) combines the prob- 
ability of each previous state and the respective value 
representing the label probability to provide a subject 
state probability over a corresponding transition. The 
overall probability of being at the subject state is 
determined from the subject state probabilities over all 
transitions leading thereto. The calculation for state 
S^, it is noted, includes terms relating to the three 
null transitions which permit the phone to start and end 
at time t^^t^^ with the phone ending in state S^. 

As with the probability determinations relative to times 
t=tQ and t=t^, probability determinations for a series 
of other end times are preferably generated to form an 
end-time distribution. The value of the end-time dis- 
tribution for a given phone provides an indication of 
how well the given phone matches the incoming labels- 



1246229 



Y0984-021 



-36- 



In determining how well a word matches a string of in- 
coming labe.Ts, the phones which represent the word are 
processed in sequence. Each phone generates an end-time 
distribution of probability values. A match value for 
the phone is obtained by summing up the end-time proba- 
bilities and then taking the logarithm of that sum. A 
start-time distribution for the next phone is derived 
by normalizing the end-time distribution by, for exam- 
ple, scaling each value thereof by dividing each value 
by the sum so that the sum gf scaled values totals one. 

It should be realized that there are at least two methods 
of determining h, the number of phones to be examined 
for a given word or word string. In a depth first method, 
computation is made along a base form --computing a run- 
ning subtotal with each successive phone. When the sub- 
total is found to be below a predefined threshold for a 
given phone position t her ea long, the computation termi- 
nates. Alternatively, in a breadth first method, a 
computation for similar phone positions in each word is 
made. The computations following the first phone in each 
word, the second phone in each word, and so on are made. 
In the breadth first method, the computations along the 
same number of phones for the various words are compared 
at the same relative phone positions therealong- In ei- 
ther method, the word(s) having the largest sum of match 
values is the sought object. 

The detailed match has been implemented in APAL (Array 
Processor Assembly Language) which is the native assem- 
bler for the Floating -Point Systems, Inc. 190L. 
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It should be cecognized that the detailed match requires 
considerable memory for storing each of the actual label 
probabilities (i.e,, the probability that a given phone 
generates a given label y at a given transition); the 
transition probabilities for each phone machine; and the 
probabilities of a given phone being at a given state 
at a given time after a defined start time. The above- 
noted FPS 190Ii is set up to make the various computations 
of end times, match values based on, for example, a sum 
— preferably the logarithmic sum of end time probabili- 
ties; start times based on the previously generated end 
time probabilities; and word match scores based on the 
match values for sequential phones in a word. In addi- 
tion, the detailed match preferably accounts for "tail 
probabilities" in the matching procedure. A tail proba- 
bility measures the likelihood of successive labels 
without regard to words. In a simple embodiment, a given 
tail probability corresponds to the likelihood of a la- 
bel following another label. This likelihood is readily 
determined from strings of labels generated by, for ex- 
ample » some sample speech. 

Hence, the detailed match provides sufficient storage 
to contain baseforms, statistics for the Markov models, 
and tail probabilities. For a 5000 word vocabulary where 
each word comprises approximately ten phpnes, the 
baseforms have a merapry requirement of 5000 x 10, Where 
there are 70 distinct phones (with a Markov model for 
each phone) and 200 distinct labels and ten transitions 
at which any label has a probability of being produced, 
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the statistics would require 70 x 10 x 200 locations. 
However, it is preferred that the phone machines are 
divided into three portions --a start portion, a middle 
portion, and an end portion— with statistics corre- 
sponding thereto. (The three self-loops are preferably 
included in successive portions.) Accordingly, the 
storage requirements are reduced to 70 x 3 x 200. With 
regard to the tail probabilities, 200 x 200 storage lo- 
cations are needed. In this arrangement, SOK integer and 
62K floating point storage performs satisfactorily. 

It should be noted that the detailed match may be im- 
plemented by using fenemic, rather than phonetic, 
phones. Appendix 1 provides a program listing that cor- 
responds to the main computational kernel of a fenemic 
detailed match. The routine in Appendix 1 extends a 
lattice --which, corresponds to a fenemic baseform of a 
current word-- forward In time by a single time step. 
The subroutine EXTLOOP is the main loop. Therebefore, 
the pipeline is started up and partial computations 
needed for the main loop are performed. After the main 
loop, partial computations remaining in the computa- 
tional pipeline are emptied. 

i 

D. Basic Fast Match 

Because the detailed match is computationally expensive, 
a basic fast match and an alternative fast match which 
reduces the computation requirements with some moderate 
sacrifice in accuracy Is provided. The fast match is 
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preferably used in conjunction with the the detailed 
match, the fast match listing likely ccindidate words 
from the vocabulary, ,and the detailed jatch being per- 
formed on» at most, the candidate words on the list. 

A fast approximate acoustic matching technique is the 
subject of the co-pending patent application entitled 
"Apparatus and Method of Performing Acoustic Matching". 
In the fast approximate acoustic match, preferably each 
phone machine is simplified by replacing the actual la- 
bel probability for each label at all transitions in a 
given phone machine with a specific replacement value. 
The specific replacement value is preferably selected 
so that the match value for a given phone when the re- 
placement values are used is an overestimation of the 
match value achieved by the detailed match when the re- 
placement values do not replace the actual label proba- 
bilities. One way of assuring this condition is by 
selecting each replacement value so that no probability 
corresponding to a given label in a given .phone machine 
is greater than the replacement value thereof. By sub- 
stituting the actual label probabilities in a phone ma- 
chine with corresponding replacement values, the number 
of required computations in determining a match score 
for a word is reduced greatly. Moreover, since the re- 
placement value is preferably an overestimation, the 
resulting match score is not less than would have pre- 
viously been determined without the replacement. 

In a specific embodiment of performing an acoustic match 
in a linguistic decoder with Markov models, each phone 



1246229 

Y0984-021 

-40- 



machlne therein Is characterized --by training-- to have 
(a) a plurality of states and transition paths between 
states, (b) transitions tr(Sj|S^) having probabilities 
TCi-^j) each of which represents the probability of a 
transition to a state given a current state where 
and Sj may be the same state or different states, and 
(c) actual label probabilities wherein each actual label 
probability p(y^|i-*j) indicates the probability that a 
label y^^ is produced by a given phone machine at a given 
transition from one state tq a subsequent state where k 
is a label identifying notation; each phone machine in- 
cluding (a) means for assigning to each yj^ in said each 
phone machine a single specific value p' (y^^) and (b) 
means for replacing each actual output probability 
p(yj^|i**j) at each transition in a given phone machine 
by the single specific value P*(y|^) assigned to the 
corresponding y^^. Preferably, the replacement value is 
at least as great as the maximum actual label probabil- 
ity for the corresponding y^^ label at any transition in 
a particular phone machine. The fast match embodiments 
are employed to define a list of on the order of ten to 
one hundred candidate words selected as the most likely 
words in the vocabulary to correspond to the incoming 
labels. The candidate words are preferably subjected to 
the language modal and to the detailed match. By paring 
the number of words considered by the detailed match to 
on the order of 1% of the words in the vocabulary, the 
computational cost is greatly reduced while accuracy is 
maintained. 



YO984-021 



1346229 



-41- 



The basic fast match simplifies the detailed match by 
replacing with a single value the actual label proba* 
bilities for a given label at all transitions at which 
the given label may be generated in a given phone ma- 
chine. That is, regardless of the transition in a given, 
phone machine whereat a label has a probability of oc- 
curring, the probability is replaced by a single spe- 
cific value. The value is preferably an overestimate, 
being at least as great as the largest probability of 
the label occurring at any transition in the given phone 
machine. 

By setting the label probability replacement value as 
the maximum of the actual label probabilities for the 
given label in the given phone machine, it is assured 
that the match value generated with the basic fast match 
is at -least as high as the match value that would result 
from employing the detailed match. In this way, the * 
basic fast match typically overestimates the match value 
of each phone so that more words are generally selected 
as candidate words. Words considered candidates accord- 
ing to the detailed match also pass muster in accordance 
with the basic fast match. 

Referring to FIG. 12, a phone machine 3000 for the basic 
fast match is illustrated. Labels (also referred to as 
symbols and fenemes) enter the basic fast match phone 
machine 3000 together with a start-time distribution. 
The start -time distribution and the label string input 
is like that entering the detailed match phone machine 
described hereinabove. It should be realized that the 
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start time may, on occasion, not be a distribution over 
a plurality of times but may, instead, represent a pre- 
cise time --for example following an interval of si- 
lence-- at which the phone begins. When speech is 
continuous, however, the end-time distribution is used 
to define the start -time distribution (as is discussed 
in greater detail hereinbelow) . The phone machine 400 
generates an end- time distribution and a match value for 
the particular phone from the generated end -time dis- 
tribution. The match score for a word is defined as the 
sum of match values for component phones — at least the 
first h phones in the word. 

Referring now to FIG. 13, a diagram of a basic fast match 
computation is illustrated. The basic fast match com- 
putation is only concerned with the start-time distrib- 
ution, the number --or length of labels-- produced by 
the phone, and the replacement values p* associated 

with each label y^^. By substituting all actual label 
probabilities for a given label in a given phone machine 
by a corresponding replacement value, the basic fast 
match replaces transition probabilities with length 
distribution probabilities and obviates the need for 
including actual label probabilities (which can differ 
for each transition in a given phone machine) and prob- 
abilities of being. at a given state at a given time. 

In this regard, the length distributions are determined 
from the detailed match model. Specifically, for each 
length in the length distribution, the procedure pref- 
erably examines each state individually and determines 
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for each state the various transition paths by which the 
currently examined state can occur (a) given a partic- 
ular label length and (b) regardless of the outputs 
along the transitions. The probabilities for all tran- 
sition paths of the particular length to each subject 
state are summed and the sums for all the subject states 
are then added to indicate the probability of a given 
length in the distribution* The above procedure is re- 
peated for each length. In accordance with the preferred 
form of the matching procedure, these computations are 
made with reference to a trellis diagram as is known in 
the art of Markov modelling. For transition paths which 
share branches along the trellis structure, the compu- 
tation for each common branch need be made only once and 
is applied to each path that includes the common branch. 

In the diagram of FIG. 13, two limitations are included 
by way of example. First, it is assumed that the length 
of labels produced by the phone can be zero, one, two, 
or three having respective probabilities of 1^, 1^^, l^j 
and 1^. The start time is also limited, permitting only 
four start times having respective probabilities of q^, 
q^^, q^, and q^. With these limitations, the following 
equations define the end-time distribution of a subject 
phone as: 

*3=l3^0* '^zW +'1oSPiP2P3 
^^'Is^lP*'- *>2^2P3P4-^1 ^3P2P3 P4 
♦s^^S^aP^Ps"" ^2^3P3P4P5 
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*6= '^zh^uPs^e 

In examining the equations, it is observed that in- 
cludes a term corresponding to each of four start times. 
The first term represents the probability that the phone 
starts at time and produces a length of zero labels 

--the phone starting and ending at the same time. The 
second term represents the probability that the phone 
starts at time t^t^, that the length of labels is one, 
and that a label 3 is produced by the phone. The third 
term represents the probability that the phone starts 
at time t=tj^, that the length of labels is two (namely 
labels 2 and 3) , and that labels 2 and 3 are produced 
by the phone. Similarly, the fourth term represents the 
probability that the phone starts at time t=t^; that the 
length of labels is three; and that the three labels 1, 
2, and 3 are produced by the phone. 

Comparing the computations required in the basic fast 
match with those required by the detailed match suggest 
, the relative simplicity of the former relative to the 
latter. In this regard, it is noted that the p' value 

remains the same for each appearance in all the 
equations as do the label length probabilities. More- 
over, with the length and start time limitations, the 
computations for the later end times become simpler. For 
example, at the phone must start at time t^t^ and 
all three labels 4, 5, and 6 must be produced by the 
phone for that end time to apply. 
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In generating a match value for a subject phone, the end 
tine probabilities along the defined end- time distrib- 
ution are summed. If desired, the log of the sum is taken 
to provide the expression: 

match value = log^^Cf^ + — + *g) 

As noted previously, a match score for a word is readily 
determined by sumioaing the match values for successive 
phones in a particular word. 

In describing the generating of the start time distrib- 
ution, reference is made to FIG. 14. In FIG. 14(a), the 
word THEj^ is repeated, broken down into its component 
phones. In FIG. 14(b), the string of labels is depicted 
over time. In FIG. 14(c), a first start-time distribution 
is shown. The first start-time distribution has been 
derived from the end -time distribution of the most re- 
cent previous phone (in the previous word which may. in- 
clude a "word" of silence). Based on the label inputs 
and the start-time distribution of FIG. 14(c), the end- 
time distribution for the phone DH, fjjjj, is generated. 
The start-time distribution for the next phone, UH, is 
determined by recognizing the time during which the 
previous phone end-time distribution exceeded a thresh- 
old (A) in FIG, 14(d). (A) is determined individually for 
each end-time distribution. Preferably, (A) is a func- 
tion of the sum of the end-time distribution values for 
a subject phone. The interval between times a and b thus 
represents the time during which the start-time dis- 
tribution for the phone UH is set. (See FIG. 14(e).) The 
interval between times c and d in FIG. 14(e) corresponds 
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to the times between which the end-time distribution for 
the phone DH exceeds the threshold (A) and between which 
the start -time distribution of the next phone is set. 
The values of the start-time distribution are obtained 
by normalizing the end-time distribution by, for exam- 
ple, dividing each end -time value by the sua of the 
end-time values which exceed the threshold (A) . 

The basic fast match phone machine 3000 has been imple- 
mented in a Floating Point Systems Inc. 1901. with an 
APAL program. Other hardware and software may also be 
used to develop a specific form of the matching proce- 
dure by following the teachings set forth herein. 

E. Alternative Fast Match 

The basic fast match employed alone or, preferably, in 
conjiinction with the detailed match and/or a language 
model greatly reduces computation requirements. To fur- 
ther reduce computational requirements, the present 
teachings further simplifies the detailed match by de- 
fining a uniform label length distribution between two;- 
lengths --a minimum length ^ maximum length 

Xi . In the basic fast match, the probabilities of a 
max 

phone generating labels of a given length --namely 1^, 
^1' ^2* "* . ^yP^*^^^^y ^^v® differing values. Accord- 
ing to. the alternative fast match, the probability for 
each length of labels is replaced by a single uniform 
value. 
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Preferably, the raiTiimum length is equal to the sooallest 
length having a nonzero probability in the original 
length distribution, although other lengths may be se- 
lected if desired. The selection of the maximum length 
is more arbitrary than the selection of the minimum 
length, but is significant in that the probability of 
lengths less than the minimum and greater than the max- 
imum are set as zero. By defining the length probability 
to exist between only the minimum length and the maximum 
-"length, a uniform pseudo-distribution can be set forth. 
In one approach, the uniform probability can be set as 
the average probability over the pseudo -distribution. 
Alternatively, the uniform probability can be set as the 
maximtim of the length probabilities that are replaced 
by the uniform value. 

The effect of characterizing all the label length prob- 
abilities as equal is readily observed with reference 
to the equations set forth above for the end-time dis- 
tribution in the basic fast match. Specifically, the 
length probabilities can be factored out as a constant. 

With L , being set at zero and all length probabilities: 
min 

being replaced by a single constant value, the end -time 
distribution can be characterized as: 

m m' Tn m-l'^m 
where "l" is the single uniform replacement value and 

where the value for p^ corresponds preferably to the 

replacement value for a given label being generated in 

the given phone at time m. 
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For the above equation for 9_, th**. match value is defined 

m 

as : 

natch value = IoSiq^^o **" ^1 ^ V ^°Siq(1) 

In comparing the basic fast match and the alternative 
fast match » it has been found that the number of required 
additions and multiplications are greatly reduced by 
employing the alternative fast match phone machines. 
With L^^^ = 0, it has been found that the basic fast 
match requires forty multiplications and twenty addi- 
tions in that the length probabilities must be consid- 
ered. With the alternative fast match, B is determined 

m 

recursively and requires one multiplication and one ad* 

dition for each successive 6 . 

m 

To further illustrate how the alternative fast match 
simplifies computations, FIG. IS and FIG. 16 are provided. 
In FIG. IS (a), a phone machine embodiment 3100 corre- 
sponding to a minimum length L . =0 is depicted. The 

ram 

maximum length is assumed to be infinite so that the 

length distribution may be characterized as uniform. 

In FIG. 15(b), the trellis diagram resulting from the 

phone machine 3100 is shown. Assuming that start times' - 

after q^ are outside the start-time distribution, all 

determinations of each successive 0 with m<n require 

ffl 

one addition and one multiplication. For determinations 
of end times thereafter, there is only one required 
multiplication and no additions. 

In FIG. 16, I. , =4. FIG. 16(a) illustrates a specific 
embodiment of a phone machine 3200 therefor and 
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FIG. 16(b) shows a corresponding trellis dia(>ran. Be- 
cause L . =4,. the trellis diagram of FIG. 16(b) has a 
rain 

zero probability along the paths marked a, v, w, and z. 
For those end times which extend between 8^ and 8^, it 
is noted that four multiplications and one addition is 
required. For end tiroes greater than n+4, one multipli- 
cation and no additions are required. This embodiment 
has been implemented in APAL code on a FPS 190Ii. 



In Appendix 2, a program listing corresponding to the 
main computational kernel of the fast (approximate) 
match is provided. The code corresponds to the case 

It should be noted that additional states may be added 
to the FIG. 15 or FIG. 16 embodiments as desired. 



F. Matching Based On First J Labels 

As a further refinement to the basic fast match and al- 
ternative fast match, it is contemplated that only the 
first J labels of a string which enters a phone machine 
be considered in the match. Assimiing that labels are 
produced by the acoiistic processor of an acoustic chan- 
nel at the rate of one per centisecond, a reasonable 
value for J is one hundred. In other words, labels cor- 
responding to on the order of one second of speech will 
be provided to determine a match between a phone and the 
labels entering the phone machine. By limiting the num- 
ber of labels examined, two advantages are realized. 



Y0984-021 



-so- 



First, decoding delay is reduced and, second, problems 
in comparing the scores of short words with long word:, 
are substantially avoided. The length of J can, of 
course, be varied as desired. 

The effect of limiting the number of labels examined can 

be noted with reference to the trellis diagram of 

FIG. 16(b). Without the present refinement, the fast 

match score is the sum of the probabilities of 8 *s along 

m 

the bottom row of the diagram. That is, the probability 

of being at state at each time starting at t^t^ (for 

^rain"^^ or t=t^ (for is determined as a and 

all 9 's are then totalled. For L , ^=4, there is no 
m min 

probability of being in state at any time before t^. 

With the refinement, the summing of 8 's terminates at 

m 

time J. In FIG. 16(b), time J corresponds to time t 

Terminating the examination of J labels over J time in- 
tervals can result in the following two probability 
summations in determining a match score. First, as de- 
scribed hereinbefore, there is a row calculation along 
the bottom row of the trellis diagram but only up to the 
time J-1. The probabilities of being in state at each- 
time up to time J-1 are summed to form a row score. 
Second, there is a column score which corresponds to the 
sum of probabilities that the phone is at each respec- 
tive state Sq through at time J. That is, the column 
score is: 

4 

column score = 2. Pr(S,,J) 
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The match score for a phone is obtained by. summing the 
row score and column score and then taking the logarithm 
of that sum. To continue the fast match for the next 
phone, the values along the bottom row --preferably in- 
cluding time J— are used to derive the next phone 
start-time distribution. 

After determining a match score for each of b consec- 
utive phones, the total for all phones is, as before 
noted, the sum of the match, scores for all the phones. 

In examining the manner in which the end-time probabil- 
ities are generated in the basic fast match and alter- 
native fast match embodiments set forth above, it is 
noted that the determination of column scores does not 
conform. readily to the fast match computations. To bet- 
ter adapt the refinement of limiting the number of la- 
bels examined to the fast match and alternative match, 
the present ipatching technique provides that the column 
score be replaced by an additional row score. That is, 
an additional row score is determined for the phone be- 
ing at state (in FIG. 16(b)) between times J and J+K 
where K is the maximum number of states in any phone 
machine. Hence, if any phone machine has ten states, the 
present refinement adds ten end times along the bottom 
row of the trellis for each of which a probability is 
determined. All the probabilities along the bottom row 
up to and including the probability at time J+K are added 
to produce a match score for the given phone. As before, 
consecutive phone match values are summed to provide a 
word match score. 
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Thls embodiment has been Implemented in APAL code on a 
FPS 190L; however as with other portions of the system 
may be implemented with other codes on other hardware. 

6. Phone Tree Structure and Fast Match Embodiments 

By employing the basic fast match or alternative fast 
match --with or without the^maximiim label limitation-- 
the computationa.l time required in determining phone 
match values is tremendously reduced. In addition « the 
computational savings remain high even when the detailed 
match is performed on the words in the fast match derived 
list. 

The phone match, values , once determined, are compared 
along the branches of a tree structure 4100 as shown in 
FIG. 17 to determine which paths of phones are most 
probable. In FIG. 17, the phone match values for DH and 
UHl (emanating from point 4102 to branch 4104) should 
sum to a much higher value for the spoken word "the" than 
the various sequences of phones branching from the phone 
MX. In this regard, it should be observed that the phone 
match value of the first MX phone is computed only once 
and is then used for each base form extending therefrom. 
(See branch'es 4104 and 4106.) In addition, when the 
total score calculated along a first sequence of 
branches is found to be much lower than a threshold value 
or much lower than the total score for other sequences 
of branches, all baseforms extending from the first se- 
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quence may be simultaneously eliminated as candidate 
words. For example, baseforms associated with branches 
4108 through 4118 are simultaneously discarded when it 
is determined that MX is not a likely path. 

With the fast match embodiments and the tree structure, 
an ordered list of candidate words is generated with 
great computational savings. 

With regard to storage requirements, it is noted that 
the tree structure of phones ^ the statistics for the 
phones, and tail probabilities are to be stored. With 
regard to the tree structure, there are 25000 arcs and 
four datawords characterizing each arc. The first 
dataword represents an index to successor arcs or 
phones. The second dataword indicates the ntimber of 
successor phones along the branch. The third dataword 
indicates at which node in the tree the arc is located. 
And the fourth dataword indicates the current phone. 
Hence, for the tree structure, 25000 x 4 storage spaces 
are required. In the fast match, there are 100 distinct 
phones and 200 distinct fenemes. In that a feneme has 
a single probability of being produced anywhere in a 
phone, storage for 100 x 200 statistical probabilities 
is required. Finally, for the tail probabilities, 200 
X 200 storage spaces are required. lOOK integer and 60K 
floating point storage is sufficient for the fast match. 



H. Language Model 
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As noted previously, a language model which stores in- 
fonnation --such as tri-grams-- relating to words In 
context may be included to enhance the probability of a 
correct word selection. Language models have been re- 
ported in the literature. 

The language model 1010, preferably, has a unique char- 
acter. Specifically, a modified tri-gram method Is 
used. In accordance with this method, a sample text Is 
examined to determine the likelihood of each ordered 
triplet of words, ordered. pair of words, or single words 
in the vocabulary. A list of the most likely triplets 
of words and a list of the most likely pairs of words 
are formed. Moreover, the likelihood of a triplet not 
being in the triplet list and the likelihood of a pair 
not being in the pair list are respectively. 

In accordance with the language model, when a subject 
word follows two words, a determination is made as to 
whether the subject word and the two preceding words are 
on the triplet list. If so, the stored probability as- 
signed to the triplet is Indicated. If the subject word 
and its two predecessors are not on the triplet list, a 
determination la made as to whether the subject word and 
Its adjacent predecessor are on the pair list. If so, 
the probability of the pair is multiplied by the proba- 
bility of a triplet not being on the triplet list, the 
product then being assigned to the subject word. If the 
subject word and its predecessor (s) are not on the 
triplet list or pair list, the probability of the sub- 
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ject word alone is multiplied by the likelihood of a 
triplet not being on the triplet list and by the proba- 
bility of a pair not being on the pair list. The product 
is then assigned to the subject word. 

Referring to FIG. 18, a flowchart 5000 illustrating the 
training of phone machines employed in acoustic matching 
is shown. At step 5002, a vocabulary of words —typi- 
cally on the order of 5000 words-- is defined. Each word 
is then represented by a sequence of phone machines. The 
phone machines have been, by way of example, been shown 
as phonetic-type phone machines but may, alternatively, 
comprise a sequence of feneoic phones. Representing 
words by a sequence of phonetic-type phone machines or 
by a sequence of fenemic phone machines is discussed 
hereinbelow. A phone machine sequence for a word is re- 
ferred to as a word baseform. 

In step 5006, the word baseforms are arranged in the tree 
structure described hereinabove. The statistics for each 
phone machine in each word baseform are determined by 
training according to the well-known forward-backward 
algorithm set forth in the article "Continuous Speech 
Recognition by Statistical Methods" by F. Jelinek. 

At step 5C09, values to be substituted for actual pa- 
rameter values or statistics used in the detailed match 
are determined. For example, the values to be substi- 
tuted for the actual label output probabilities are de- 
termined. In step 5010, the determined values replace 
the stored actual probabilities so that the phones in 
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each wore* baseforn include the approximate substitute 
value 4. All apprcxlmatlons relating to the basic fast 
match are performed in step 5010. 

A decision is then mude as to whether the acoustic 
matching is to be enhanced (step 5011). If not, the 
values determined for the basic approximate match are 
set for use and other estimations relating to other ap- 
proximations are not set (step 5012). If enhancement is 
desired, step 5018 is followed, A uniform string length 
distribution is defined (step 5018) and a decision is 
made as to whether further enhancement is desired (step 
5020). If not, label output probability values and 
string length probability values are approximated and 
set for use in the acoustic matching. If further en- 
hancement is desired, acoustic matching is limited to 
the first J labels in the generated string (step 5022). 
VHiether or not one of the enhanced embodiments is se- 
lected, the parameter values determined are set in step 
5012, whereupon each phone machine in each word baseform 
has been trained with the desired approximations that 
enable the fast approximate matching. 

J. Stack Decoder 

A preferred stack decoder used in the speech recognition 
system of FIG.l has been invented by L. Bahl, F. Jelinek, 
and R.L. Mercer of the IBM Speech Recognition Group. The 
preferred stack decoder is now described. 
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In FIG. 19 and FIG. 20, a plurality of successive labels 
y^y^--- are shown generated at successive "label inter- 
vals", or "label positions". 

Also shown in FIG. 20 are a plurality of some generated 
word paths, namely path A| path fi, and path C. In the 
context of FIG. 19, path A could correspond to the entry 
"to be or", path B to the entry "two b", and path C to 
the entry "too". For a subject word path, there is a 
label (or equivalently a laljel interval) at which the 
subject word path has the highest probability of having 
ended --such label being referred to as a "boundary la- 
bel". 

For a word path W representing a sequence of words, a 
most likely end time — represented in the label string 
as a "boundary label" between two words-- can be found 
by known methods such as that described in an article 
entitled "Faster Acoustic Match Computation" (by L.R. 
Bahl, F, Jelinek, and R.L. Mercer) in the IBM Technical 
Disclosure Bulletin volume 23^ number 4, September 1980. 
Briefly, the article discusses methodology for address- 
ing two similar concerns: (a) how much of a label string.; 
Y is accounted for by a word (or word sequence) and (b) 
at which label interval does a partial sentence --cor- 
responding to a part of the label string-- end. 

For any given word path, there is a "likelihood value" 
associated with each label or label interval, including 
the first lah&l of the label string through to the 
boundary label. Taken together, all of the likelihood 
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values for a given word path represent a "likelihood 
vector** for the given word path. Accordingly, for each 
word path there is a corresponding likelihood vector. 
Likelihood values are illustrated in FIG. 20. 

A "likelihood envelope" at a label interval t for a 

collection of word paths W^, W^,.,.,W® is defined math- 
ematically as: 

A^=max(L^(wb.™,r.^(W^ )) ' 

That is, for each label interval, the likelihood envel- 
ope includes the highest likelihood value associated 
with any word path in the collection. A likelihood en- 
velope 1040 is illustrated in FIG. 20. 

A word path is considered "complete" if it corresponds 
to a complete sentence, A complete path Is preferably 
identified by a speaker entering an input, e.g. pressing 
a button, when he reaches the end of a sentence. The 
entered input is synchronized with a label interval to 
mark a sentence end. A complete word path cannot be ex- 
tended by appending any words thereto. A "partial*' word 
path, corresponds to an incomplete sentence and can be 
extended. 

Partial paths are classified as "live** or "dead". A word 
path is "dead" if it has already been extended and "live" 
if it has not. With this classification, a path which 
has already been extended to form one or more longer 
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extended word paths is not reconsidered for ex*:ension 
at a subsequent time. 

Each word path is also characterizable as "good" or 
"bad" relative to the likelihood envelope. The word path 
is good if, at the label corresponding to the boundary 
label thereof, the word path has a likelihood value 
which is within. A of the maximum likelihood envelope. 
Otherwise the word path is marked as "bad". Preferably, 
but not necessarily, A is affixed value by which each 
value of the maximum likelihood envelope is reduced to 
serve as a good/bad threshold level. 

For each label interval there is a stack element. Each 
live word path is assigned to the stack element corre- 
sponding to the label interval that corresponds to the 
boundary label of such a live path. A stack element may 
have zero, one, or more word path entries --the entries 
being listed in order of likelihood value. 

The steps performed by the stack decoder 1002 of FIG.l 
are now discussed. 

Forming the likelihood, envelope and determining which 
word paths are "good" are interrelated as suggested by 
the sample flowchart of FIG. 21. 

/ 

In the flowchart of FIG, 21, a null path is first entered 
into the first stack(O) in step 5050. A stack(complete) 
element is provided which contains complete paths, if 
any, which have been previously determined (step 5052). 
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Each complete path in the stack(complete) element has a 
likelihood vector associated therewith. The likelihood, 
vector of the complete path having the highest likeli- 
hood at the boundary label thereof initially defines the 
maximum likelihood envelope. If there is no complete 
path in the stack(complete) element, the maximum like- 
lihood envelope Is initialized as — at each label in- 
terval. Moreover, if complete paths are not specified, 
the maximum likelihood envelope may be initialized at 
— Initializing the envelope is depicted by steps 5054 
and 5056. 

After the maximum likelihood envelope is initialized, 
it is reduced by a predefined amount A to form a A-good 
region above the reduced likelihoods and a A-bad region 
below the reduced likelihoods. The value of A controls 
the breadth of the search. The larger A is, the larger 
the number of word paths that are considered for possi- 
ble extension. Vhen log^Q is used for determining L^, a 
value of 2.0 for A provides satisfactory results. The 
value of A is preferably, but not necessarily, uniform 
along the length of label intervals. 

If a word path has a likelihood at the boundary label 
thereof which is in the A-good region, the word path is 
marked "good". Otherwise, the word path is marked "bad". 

As shown in FIG. 21, a loop for up-dating the likelihood 
envelope and for marking word paths as "good" (for pos-" 
sible extension) or "bad" starts with the finding of the 
longest unmarked word path (step 5058). If more than one 
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unmarked word path is In the stack corresponding to the 
longest word path length, the word path having the 
highest likelihood at the boundary label thereof is se- 
lected. If a word path is found, it is marked as "good" 
if the likelihood at the boundary label thereof lies 
within the A-good region or "bad" otherwise (step 5060). 
If the word path is marked "bad", another unmarked live 
path is fo\ind and marked (step 5062). If the word path 
is marked "good", the likelihood envelope is up-dated 
to include the likelihood values of the path marked 
"good". That is, for each label interval, an up-dated 
likelihood value is determined as the greater likelihood 
value between (a) the present likelihood value in the 
likelihood envelope and (b) the likelihood value asso- 
ciated with word path marked "good"i This is illustrated 
by steps 5064 and 5066. After the envelope is up-dated, 
a longest best unmarked live word path is again found 
(step 5058), 

The loop is then repeated until no unmarked word paths 
remain. At that time, the shortest word path marked 
"good" is selected. If there is more than one word "good" 
path having a shortest length, the one having the high- 
est likelihood at the boundary label thereof is selected 
(step 5070). The selected shortest path is then sub- 
jected to extension. That is, at least one likely fol- 
lower word is determined as indicated above by 
preferably performing the fast rpatch, language model, 
detailed match, and language model procedure. For each 
likely follower word, an extended word path is formed. 
Specifically, an extended word path is formed by ap- 
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pending a likely follower word on the end of the selected 
shortest word path. 

After the selected shortest word path is formed into 
extended word paths, the selected word path is removed 
from the stack in which it was an entry and each extended 
word path is entered into the appropriate stack there- 
for. In particular, an extended word path becomes an 
entry into the stack corresponding to the boxindary label 
of the extended word path st;ep 5072. 

With regard to step 5072, the action of extending the 
chosen path is discussed with reference to the flowchart 
of FIG. 22. After the path is found in step 5070, the 
following procedure is performed whereby a word path or 
paths are extended based on an appropriate approximate 
match. 

At step 6000, the acoustic processor 1002 (of FIG.l) 
generates a string of labels as described hereinabove. 
The string of labels is provided as input to enable step 
6002 to be performed. In step 6002 the basic or one of 
the enhanced approximate matching procedures is per- 
formed to obtain an ordered list of candidate words ac- 
cording to the teachings outlined hereinabove. 
Thereafter, a language model (as described hereinabove) 
is applied in step 6004 as described hereinabove. The 
subject words remaining after the language model is ap- 
. plied are entered together with the generated labels in 
a detailed match processor which performs step 6006. The 
detailed match results in a list of remaining candidate 
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words which are preferably subjected to the language 
model in step 6008. The likely words --as determined by 
the approximate match, detailed match, and language 
model are used for extension of the path found in step 
5070 of FIG. 21. Each of the likely words determined at 
step 6008 (FIG. 22) are separately appended to the found 
word path so that a plurality of extended word paths may 
be formed. 

Referring again to FIG. 21, ^fter the extended paths are 
formed and the stacks are re-formed, the process repeats 
by returning to step 5052. 

Each iteration thus consists of selecting the shortest 
best "good" word path and extending it. A word path 
marked "bad" on one iteration may become "good" on a 
later iteration. The characterization of a live word 
path as "good" or "bad" is thus made independently on 
each iteration. In practice, the likelihood envelope 
does not change greatly from one iteration to the next 
and the computation to decide whether a word path is 
"good" or "bad" is done efficiently. Moreover, normal- 
ization is not required. 

When complete sentences are identified, step 5074 is 
preferably included. That is, when no live word paths 
remain unmarked and there are no "good" word paths to 
be extended, decoding is finished. The complete word ' 
path having the highest likelihood at the respective 
boundary label thereof is identified as the most likely 
word sequence for the input label string. 
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In the case of continuous speech where sentence endings 
are not identified, path extension proceeds continually 
or for a predefined number of words as preferred by the 
system user, 

K. Constructing Phonetic Baseforms 

One type of Markov model phone machine which can be used 
in forming baseforms is based on phonetics. That is, 
each phone machine corresponds to a given phonetic 
sound, such as those included in the International Pho- 
netic Alphabet. 

For a given word, there is a sequence of phonetic sounds 
each having a respective phone machine corresponding 
thereto. Each phone machine includes a number of states 
and a number of transitions between states, some of 
which can produce a feneine output and some (referred to 
as null transitions) which cannot. Statistics relating 
to each phone machine —as noted hereinabove— include 

(a) the probability of a given transition occurring and 

(b) the likelihood of a particular feneme being produced 
at a given transition. Preferably, at each non-null 
transition there is some probability associated with 
each feneme. In a feneme alphabet shown in Table 1, there 
are preferably 200 fenemes. A phone machine used in 
forming phonetic baseforms is illustrated in FIG. 3. A 
sequence of isuch phone machljjes is provided for each 
word. The statistics, or probabilities, are entered into 



Y0984-021 



-65- 



the phone machines during a training phase in which 
known words are uttered. Transition probabilities and 
feneme probabilities in the various phonetic phone ma- 
chines are determined during training by noting the 
fenenee strlng(s) generated when a known phonetic sound 
is uttered at least once and by applying the well-known 
forward-backward algorithm. 

A sample of statistics for one phone identified as phone 
DH are set forth in Table 2. As an approximation, the 
label output probability distribution for transitions 
trl, tr2, and tr8 of the phone machine of FIG. 3 are re- 
presented by a single distribution; transitions tr3, 
tr4, trS, and tr9 are represented by a single distrib- 
ution; and transitions tr6, tr7, and trlO are repres- 
ented by a single distribution. This is shown In Table 
2 by the assignment of arcs (I.e. transitions) to the 
respective columns 4, 5, or 6. Table 2 shows the prob- 
ability of each transition and the probability of a la- 
bel (i.e. feneme) being generated in the beginning, 
middle, or end, respectively, of the phone DH. For the 
DH phone, for example, the probability of the transition 
from state to state Is counted as .07243. The 

probability of transition from state S, to state S, Is 

1 4 

.92757. (In that these are the only two possible tran- 
sitions from the Initial state, their sum equals unity.) 
As to label output probabilities, the DH phone has a ,091 
probability of producing the feneme AE13 (see Table 1) 
at the end portion of the phone, i.e. column 6 of Table 
2. Also in Table 2 there Is a count associated with each 
node (or state). The node count is indicative of the 
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number of timas during the training that the phone was 
In the corresponding state. Statistics as in Table 2 are 
found for each phoneme machine. 

The arranging of phonetic phone machines into a word 
base form sequence is typically performed by a 
phonetician and is normally not not done automatically. 

L. Constructing Feneroic Baseforms 

FIG. 23 shows an embodiment of a fenemic phone. The 
fenemic phone has two states and three transitions. A 
non-null transition is indicated with dashed lines and 
represents a path from state 1 to state 2 in which no 
label can be produced. A self- loop transition at state 
1 permits any number of labels to be produced thereat. 
A non-null transition between state 1 and state 2 is 
permitted to have a label produced thereat. The proba- 
bilities associated with each transition and with each 
label at a transition are determined during a training 
session in a manner similar to that previously described 
with reference to Phonetic-type baseforms. 

Fenemic word baseforms are constructed by concatenating 
fenemic phones. One appr:oach is described in the co- 
pending application entitled "Feneme -based Markov Models 
for Words". Preferably, the fenemic word baseforms are 
grown from multiple utterances of the corresponding 
word. This is described in a co-pending and commonly 
assigned application entitled "Constructing Markov Mod- 
els of Words from Multiple Utterances", Canadian Patent 

Application No. 504,801, filed March 24, 1986. 



TO984-021 



1246229 



Briefly, one method of growing baseforms from multiple 
utterances includes the steps of: 



(a) transforming multiple utterances of the 
word segment into respective strings of 
f enemes ; 

10 (b) defining a set of fenemic Markov model 

phone machines; 

(c) determining the best single phone machine 
for producing the multiple feneme strings; 

(d) determining the best two phone baseforra 
15 of the form Pj^P^ or P^P^ for producing the 

multiple feneme strings; 

(e) align the best two phone baseform against 
each feneme string; 

(f) splitting each feneme string Into a left 
20 portion and a right portion with the loft 

portion corresponding to the first phone ma- 
chine of the two phone baseform and the right 
portion corresponding to the second phone ma- 
chine of the two phone baseform; 
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(g) identifying each left portion as a Inft 
substring and each right pr rtion as a right 
substring; 

(h) processing the set of left substr.-'.ngs in 
the same manner as the set of feneme strings 
corresponding to the multiple utterances in- 
cluding the further step of inhibiting further 
splitting of a substring when the single phone 
baseform thereof has a iiigher probability of 
producing the substring than does the best two 
phone baseform; 

(j) processing the set of right substrings in 
the same manner as the set of feneme strings 
corresponding to the multiple utterances, in- 
cluding the further step of inhibiting further 
splitting of a substring when the single phone 
baseform thereof has a higher probability of 
producing the substring than does the best two 
phone baseform; and 

(k) concatenating the unsplit single phones 
in an order corresponding the order of the 
feneme substrings to which they correspond. 

The number of model elements is typically approximately 
the number of fenemes obtained for an utterance of the 
word. The baseform models are then trained (or filled 
with statistics) by speaking known utterances which into 
an acoustic processor that generates a string of label 
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in response thereto. Based on the known utterances and 
the generated labels, the statistics for the vord models 
are derived according to the well-known forwaird-backward 
algorithm discussed in articles referred to hereinabove. 

In FIG. 24 a lattice corresponding to fenemic phones is 
illustrated. The lattice is significantly simpler than 
the lattice of FIG. 11 relating to a phonetic detailed 
match. As noted hereinabove, a fenemic detailed match 
proceeding one time interval at a time through the 
lattice of FIG. 24 is set forth in Appendix 1. 

(II) Selecting Likely Words From a Vocabulary By Means 
of Polling 

Referring to FIG. 25, a flowchart 8000 of one embodiment 
of the present inventipn is illustrated. As depicted in 
FIG. 25, a vocabulary of words is initially prescribed 
in step 8002. The words may relate to standard office 
correspondence words or technical words, depending on 
the user. There have been on the order of 5000 words 
or more in the vocabulary, although the nuaber of words 
may vary. 

Each word is represented by a sequence of Markov model 
phone machines in accordance with the teachings of Sec- 
tion (I)(K) or (I)(L). That is, each word may be re- 
presented as a constructed basefona of sequential 
phonetic phone machines or a constructed baseform of 
sequential fenemic phone machines. 
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A "vote" is then determined for each label for each word 
in step 8006 < The vote determining step 8006 is de- 
scribed with reference to FIGS.25, 26, 27, 28, and 29, 

Fig, 26 shows a graph of the distribution of acoustic 
labels for a given phone machine P^, The counts indi- 
cated are extracted from the statistics generated during 
training. During training, it is recalled, known utter- 
ances corresponding to known phone sequences are spoken 
and a string of labels generated in response thereto. 
The number of times each label is produced when a known 
phone is uttered is thus provided during the session. 
For each phone, a distribution as in Fig. 26 is gener- 
ated. 

In addition to extracting the information included in 
Fig, 26 from the training data, the expected number of 
labels for a given phone is also derived from the 
training data. That is, when a known utterance corre- 
sponding to a given phone is spoken, the number of labels 
generated is noted. Each time the known utterance is 
spoken, a number of labels for the given phone is noted. 
From this information, the most likely or expected num- 
ber of labels for the given phone is defined. FIG. 27 is 
a graph showing the expected number of labels for each 
phone. If the phones correspond to fenemic phones, the 
expected number of labels for each phone should typi- 
cally average about one. For phonetic phones, the number 
of labels may vary greatly. 
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Tlie extraction of information in the graphs from train- 
ing data is achieved by using information from the 
forward-backward algorithm described in detail in Ap- 
pendix II of the article "Continuous Speech Recognition 
by Statistical Methods". Briefly, the forward backward 
algorithm involves determining the probability of each 
transition between a state 1 and a state (i+1) in a phone 
by (a) looking forward from the initial state of a Markov 
model to the state i and determining the statistics of 
getting to the state i in a /'forward pass" and (b) 
looking backward from the final state of a Markov model 
to the state (i+1) and determining the statistics of 
getting to the final state from state (i+1) in a "back- 
ward pass". The probability of the transition from state 
i to state (i+1) --given state i-- and the label outputs 
thereat are coupled to the other statistics in deter- 
mining the probability of a subject transition occurring 
, given a certain string of labels. In that Appendix II 
of the above-noted article sets forth in detail the 
mathematics and application of the algorithm, further 
description is not provided. 

Each word is known to be a predefined sequence of phones 
as illustrated in FIG. 28 with reference to a WORD 1 and 
a WORD 2. Given the phone sequence for each word and the 
information discussed relative to FIGS. 25 and 26, a 
determination can be made as to how many times a given 
label is most likely to occur for a particular subject 
word W. For a word such as WORD 1, the number of times 
label 1 is expected may be computed as the niimber of 
coxints of label 1 for phone P. plus the number of counts 
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of label 1 for phone plus the number of counts of 
label 1 for phone and so on. Similarly, for the word 
WORD 1, the number of times label 2 is expected may be 
computed as the number of counts of label 1 for phone 
P^ plus the number of counts of label 2 for phone P^ and 
so on. The expected count of each label for WORD 1 is 
evaluated by performing the above steps for each of the 
two hundred labels. A count for a particular label may 
represent the number of occurrences of the particular 
label divided by the total njimber of labels generated 
during the training phase or may, alternatively, corre- 
spond to just the number of occurrences of the label 
during training. 

In FIG. 29 the expected count for each label in a par- 
ticular word (e.g. WORD 1) is set forth. 

From the expected label counts shown in FIG. 29 for a 
given word, a "vote" of each label for the given word 
is evaluated. The vote of a label L' for a given word 
W' represents the likelihood of the word W* producing 
the label L' . The vote preferably corresponds to the 
logarithmic probability of word W' producing L* . Pref- 
erably, the vote is determined by 

vote= log^^CPrCLMW')} 

The votes are stored in a table as shown In FIG. 30. For 
each word 1 through W, each label has an associated vote 
Identified as V with a double subscript. The first ele- 
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ment in the subscript corresponds to the label and the 
second to the word. V^^ therefore the vote of label 
1 for word 2. 

Referring again to FIG. 25, the process of selecting 
likely candidate words from a vocabulary through polling 
Is shown to Include a step 8008 of generating labels In 
response to an unknown spoken Input. This is performed 
by the acoustic processor 1004 (of FIG.l). 

The generated labels are looked up in the table of FIG, 30 
for a subject word. The vote of each generated label for 
the subject word is retrieved. The votes are then accu- 
mulated to give a total vote for the subject word (step 
8010). By way of example, if labels 1, 3, and 5 are 
generated, votes V^^, V^^ and V^^ would be evaluated and 
combined. If the votes are logarithmic probabilities, 
they are summed to provide the total vote for word 1, A 
similar procedure is followed for each word of the words 
in the vocabulary, so that labels 1, 3, and 5 vote for 
each word. 

In accordance with one embodiment of the Invention, the 
accumulated vote for each word serves as the likelihood 
score for the word. The n words (where n is a predefined 
integer) having the highest accumulated vote are defined 
as candidate words that are to be later processed in a 
detailed match and language model as outlined above. 

In another embodiment, word "penalties" are evaluated 
as well as votes.. That is, for each word, a penalty is 
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determined and assigned (step 8012). Tlie penalty re- 
presents the likelihood that a subject label is not 
produced by a given word. There are various methods for 
determining the penalty. One approach for determining 
the penalty for a word represented by a fenemic baseform 
involves assuming that each fenemic phone produces only 
one label. For a given label and a subject fenemic phone. * 
the penalty for the given label corresponds to the log 
probability of any other label being produced by the 
subject fenemic phone. The penalty of label 1 for phone 
P2 then corresponds to the log probability of any label 
2 through 200 being the one produced label. The assump- 
tion of one label output per fenemic phone, although.not 
a correct one, has proved satisfactory in evaluating 
penalties. Once a penalty of a label for each phone is 
determined, the penalty for a word constituting a se- 
quence of known phones is readily determined. 

The penalty of each label for each word is shown in 
FIG. 31. Each penalty is identified as PEN followed by 
two subscripts, the first representing the label and the 
second representing the word. 

Referring again, to FIG. 25, the labels generated In step 
8008 are examined to see which labels of the label al- 
phabet have not been generated. The penalty for each 
label not generated is evaluated for each word. To ob- 
tain the total penalty for a given word, the penalty of 
each ungenerated label for the given word is retrieved 
and all such penalties are accumulated (step 8014). If 
each penalty corresponds to a logarithmic "null" proha- 
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bility, th3 penalties for the given word are summed over 
all labels, as were the votes. The above procedure is 
repeated for each word in the vocabulary so that each 
word has a total vote and a total penalty given the 
string of generated labels. 

When a total vote and a total penalty is derived for each 
vocabulary word, a likelihood score is defined by com- 
bining the two values (step 8016). If desired the total 
vote may be weighted to be greater than the total pen- 
alty, or vice versa. 

Moreover, the likelihood score for each word is prefer- 
ably scaled based on the length of the number of labels 
that are voting (step 8018). Specifically, after the 
total vote and the total penalty --both of which which 
represent sums of logarithmic probabilities-- are added 
together, the final sum is divided by the number of 
speech labels generated and considered in computing the 
votes and penalties. The result is a scaled likelihood 
score. 

A further aspect of the invention relates to determining 
which labels in a string to consider in the voting and 
penalty (I.e. polling) computations. Where the end of a 
word is identified and the labels corresponding thereto 
are known, preferably all labels generated between a 
known start time and the known end time are considered. 
However, when the end time is found to be not known (In 
step 8020) , the present invention provides the following 
methodology, A reference end time is defined and a 
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likelihood score Is evaluated repeatedly after the ref- 
erence orl time at successive time intervals (step 
8022). For example, after 500ms the (scaled) likelihood 
score for each word is evaluated at 50ms intervals 
thereafter up to lOCOms from the start time of the word 
utterance. In the example, each word will have ten 
(scaled) likelihood scores. 

A conservative approach is adopted in selecting which 
of the ten likelihood scores, should be assigned to a 
given word. Specifically, for the series of likelihood 
scores obtained for a given word, the likelihood score 
that is highest relative to the likelihood scores of 
other words obtained at the same time interval is se- 
lected (step 8024), This highest likelihood score is 
then subtracted from all likelihood scores at that time 
interval. In this regard, the word having the highest 
likelihood score at a given interval is set at zero with 
the* likelihood scores of the other less likely words 
having negative values. The least negative likelihood 
score fbr a given word is assigned thereto as the highest 
relative likelihood score for the word. 

Whien likelihood scores are assigned to each word, the n 
words having the highest assigned likelihood scores are 
selected as candidate words resulting from polling (step 
8026).. 

In one embodiment of the invention, the n words result- 
ing from the polling are provided as a reduced list of 
words that are then subjected to processing according 
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to the detailed match and language model described 
above. The reduced list obtained by polling, In this 
embodiment, acts in place of the acoustic fast match 
outlined above. In this regard, it is observed that the 
acoustic fast match provides a tree- J ike lattice struc- 
ture in which word baseiforms are entered as sequential 
phones, wherein words having the same initial phones 
follow a common branch along the tree structure. For a 
2000 word vocabulary, the polling method has been found 
to be two to three tiroes faster than the fast acoustic 
match which Includes the tree-like lattice structure. 

Alternatively, however, the acoustic fast match and the 
polling may be used in conjunction. That is, from the 
trained Markov models and the generated string of la- 
bels, the approximate fast acoustic match is performed 
in step 8028 In parallel with the polling. One list is 
provided by the acoustic match and one list is provided 
by the polling. In a conservative approach, the entries 
on one list are used in augmenting the other list. In 
an approach which seeks to further reduce the number of 
best candidate words , only words appearing in both lists 
are retained for further processing. The interaction of 
the two techniques In step 8030 depends on the accuracy 
and computational goals of the system. As yet another 
alternative, the lattice-style acoustic fast uatch may 
be applied to the polling list sequentially. 

Apparatus 8100 for performing the polling Is set forth 
in FIG. 32. Element 8102 stores word models that have 
been trained as discussed hereinabove- From the s tat is- 
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tlcs applied to the word models, a vote generator 8104 
evaluates the vote of each label for each word and stores 
the votes in vote table storage 8106. 

Similarly, a penalty generator 8108 evaluate? penalties 
of each label for each word in the vocabulary and enters 
the values into penalty table storage 8110. 

A word likelihood score evaluator 8112 receives labels 
generated by an acoustic processor 8114 in response to 
an unknown spoken input. For a given word selected by 
a word select element 8116, the word likelihood score 
evaluator 8112 combines the votes of each generated la- 
bel for the selected word together with the penalties 
of each label not generated. The likelihood score eval-* 
uator 8112 includes means for scaling the likelihood 
score as discussed hereinabove. The likelihood score 
evaluator may also, but need not, include means for re- 
peating the score evaluation at successive time inter- 
vals following a reference time. 

The likelihood score evaluator 8112 provides word scores 
to a word lister 8120 which orders the words according : 
to assigned likelihood score. 

In an embodiment which combines tUe word list derived 
by polling with the list derived by approximate acoustic 
matching, the list comparer 8122 is provided. The list 
comparer receives as input the polling list from the 
word lister 8120 and from the acoustic fast match (which 
is described in several embodiments hereinabove) . 
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To reduce storage and computational requiremen* s, se- 
veral features may be included. First, the vottj's and 
penalties can be formatted as Integeris ranging between 
0 and 255. Second, the actual penalties can be replaced 
by approximate penalties computed from the corresponding 
vote as PEN= a.vote+b where a,b are constants and can 
be determined by least squares regression. Third, labels 
can be grouped together Into speech classes where each 
class includes at least one label. The allocation of 
labels to classes can be determined by hierarchically 
clustering the labels so as to maximize the resulting 
mutual Information between speech classes and words. 

It should be further noted that, in accordance with the 
Invention, periods of silence are detected (by known 
methods) and are ignored. 

The present invention has been implemented in PL/I on 
an IBM MVS System, but may be implemented in other pro- 
gramming languages and on other systems given the 
teachings set forth herein. 

While the invention has been described with reference 
to a preferred embodiment thereof, it will be understood 
by those skilled in the art that various changes in form 
and details may be made without departing from the scope 
of the invention. 

For example,' the end time for a word may alternatively 
be defined by summing the number of expected labels for 
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each phone in the word base form. In addition , the votes 
and penalties may be determined for selected generated 
labels --such as every odd label or just the first m 
labels in the string-- although it is preferred that 
votes and penalties for each label between the start and 
end of a word be considered. 

Moreover, the present invention contemplates the use of 
various voting formulas other than the summing of loga- 
rithmic probabilities. The invention generally applies 
to a polling apparatus and a polling method for obtain- 
ing a short list of candidate words wherein each label 
casts a vote for each word in the vocabulary and that 
this vote typically varies from word to word. 
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. TABLE 1 

THE TWO LETTERS ROUGHLY REPRESENT THE Sounq OF THE 
ELEMENT, 

TWO DIGITS ARE ASSOCIATED WITH VOWELS: 
FIRST: . STRESS OF SOUND 
SECQND: CURRENT IDENTIFICATION NUMBER 
ONE DIGIT ONLY IS ASSOCIATED WITH CONSONANTS. 
SINGLE DIGIT: CURRENT IDENTIFICATION NUMBER 



10 



15 



20 



25 



30 



35 



001 
002 
003 
004 
005 
006 
007 
003 
009 
010 
Oil 
2 

013 
014 
015 
.016 
017 
018 
019 
020 
021 
022 
023 
024 
025 
026 
027 
028 



AAll 

AA12 

AA13 

AA14 

AA15 

AEll 

AE12 

AEI3 

AEI4 

AF15 

AWll 

AWl*? 

AW13 

AXll 

AX12 

AX13 

AX14 

AXIS 

AX16 

AX17 

BQl- 

E02- 

BQ3- 

BQ4- 

BXl- 

BXIO 

BXll 

BX12 



029 

030 

031 

032 

033 

034 

035 

036 

037 

038 

039 

040 

041 

042 

043 

044 

045 

046 

047 

048 

049 

OSO 

051 

052 

053 

054 

055 

056 



RX2- 

BX3- 

BX4- 

BX5- 

BX6- 

BX7- 

BX8- 

BX9- 

DHl- 

DH2- 

DQl- 

D02- 

DQ3- 

DQ4- 

DXl- 

DX2- 

EEOl • 

ES02 

EE 11 

EE12 

EE13 

EE14. 

EE15 

EE16 

EE17 

ES18 

EE19 

EHOl 



057 

058 

059 

060 

061 

062 

126 

127 

128 

129 

130 

131 

12i 

133 

134 

135 

136 

137 

138 

139 

IdO 

141 

142 

143 

144 

145 

146 

147 



EH02 

EHll 

EH12 

EH13 

EH14 

EH15 

RXl- 

SHl- 

SH2- 

SXl- 

SX2- 

SX3- 

SX4- 

SX5- 

SX6- 

SX7- 

THl- 

TH2- 

TH3.- 

TH4- 

THS- 

TQl- 

T02- 

TX3- 

TXl- 

TX2- 

TX3- 

TX4- 



148 

149- 

150, 

151 

152 

153 

154 

155 

156 

157 

158 

159 

160- 

161 

162 

163 

164 

165 

166 

167 

168 

169 

170 

171 

172 

173 

174 

175 



TX5- 
TX6- 
UHOl 
UH02 
UHll 
UH12 
UH13 
UH14 
UUll 

nui2 

U-.Gl 
UXG2 
UXll • 
UX12 
UX13 

vxi- 

VX2- 
VX3- 

v:<4- 
wxi- 

WX2- 
WX3- 
WX4- 
WX5- 
WX6- 
WX7- 
XXI- 
XXlO 



176 
177 
178 
179 
180 
181 
182 
183 
1^4 
185 
186 
137 
188 
189 
190 
191 
192 
193 
194 
195 
196 
197 
198 
199 
200 



XXll 
XX12 
XX13 
XX14 
XX15 
XX16 
XX 17 
XX18 
XXI 9 
XX2- 
XX20 
XX21 
XX22 
XX23 
XX24 

XX4- 
XX5- 
XX 6- 
XX7- 
XX8- 
XX9- 
2X1- 
ZX2- 
ZX3- 
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PHONE 

U3£L 
COUMT 



OH 
1 

8 

31.0 



7 NOOES. 13 ARCS. 



2 



.J 

1.7 



It 

119,1 



TABLE 2 
3 ARC U3ELS. 



12 
MS.'* 



6 

t2oII 



7 
0 

0.0 



ARC 

LA3EL 

PROS 

ARC 

LA3eL 

PR03 



1 -> 2 

0.072*13 



t -> A 
0.92757 



0.25611 0,75370 



1 -> 7 

NULL 
O.OOOOO 

I 

0,2li63O 



2.> I 

0.99259 



2 -> 7 

NULL 
0.007tl1 



0.93982 



3 7 
NULL 
0.06013 



0.7S179 



i» -> 5 5 -> 
0.2!4321 0.7ti33 



LA3£L 
! COUNT 

A£13 
' BXIO 

i 

i 8X3- 

' onr 

0(12" 
EHOT 

) EH02 
EHM 

i EHl) 
£Ml4 
£R1^ 

r:<2 

GX2- 

, Gxs: 



k 

120.8 



030 
130 
Oil 
020 
01 1 
010 



0.1^8 



1t|6. 



0.086 
0.040 
0.052 
O.OI't 



o.oti; 

O.ltiS 
0.013 



6 

121.6 
0.091 



0.013 

0. 167 

0.026 
0.015 
0.012 
0.062 
0*024 



CX6 

Hxr 

1X05 

ixt3 

XQI 

:<X2- 

f1X2" 
MX3" 

:iX5" 

HX6- 

CU1^4 

?v 

TH2- 

TQ3- 

UHOT 

UH02 

UXG2 

UX12 

UXI3 

VXI 

VX3" 

WX2- 

.XX2J 

OTHER 



0.2ti6 

o.on 

0.025 



0.029 
0.019 
0.0^9 



0.029 



0.025 



0.0(1 1 
0.023 
0.072 

0.073 



0.023 
0.011 



O.Oli^ 
0,013 
0.043 



0.017 

0.018 
0.020 
0.017 

0.032 



0.23? 

o.oiQ 

0.047 



0.020 
0.026 
0.024 

0.012 



0.012 
0.023 



0.020 
0. 109 
0.016 
0.062 
0. 181 
0.016 
0.016 



0.048 
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tiitn titiMiitinnittinini till rttririfirtt II rtiitiiiittttttittttti tilt lilt ttiiiiiiiittniitttititiiintnMtttittt till titttiitttrttt 



" Subroutine EXTCOL, 
II 



" Extends the column 
It 

" Register Usage: 
It 

SP: , 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, IS 



" DPX: -4, -3, -2, -1, 0 (DPA = 0) " 

•' DPY: -4, -3, -2, -1, 0, 1, 2, 3 (DPA =« 0) 

It " 
IIIMIIIttttllttlfltttttltltttfllltllttltllltttllltttltltMtltltttlMtflttlltllltllMlttllttllMlttlttlltltltttltttllltMttMltlllllttIt 

ItltttlllHIttlHttl tItltttttltltltlllllttltlttllttlttttttttttlltrtttlttfltlttlltllttttlttlllftltlttlMttltttttltftlttlltlttttttlltltltl 

" Restore the registers from main memory. " 

M tilt lilt lift lltt It till tttt tttllMllltttl till tllllllltt It It till tilt It It It tltttltttttt It tilt lilt tllttltltllttlllttlltlMtllintt It till 



EXTCOL: LDMA; DB = SFENPTR; WRTLMN 
LDMA; DB = STLPPTR; WRTLMN 
LDMA; DB = SSTROFF; WRTLMN 

LDSPI FENPTR; DB < MD 
LDSPI TLPPTR; DB < MD 
LDSPI STROFF; DB < MD 

LDMA; DB = SSTRROW; TOTLMN 
LDMA; DB = SOUTOFF; WRTLMN 
LDMA; DB =» SINBNDY; WRTLMN 

LDSPI STRROW; DB < MD 
LDSPI OUTOFF; DB < MD 
LDSPI INBNDY; DB < MD 

LDMA; DB = STIME; WRTLMN 
LDMA; DB = SSTRLEN; WRTLMN 
LDMA; DB = ALEXLEX; WRTLMN 

LDSPI TIME; DB < MD 
LDSPI STRLEN; DB < MD 
LDSPI LOOPCNT; DB < MD 

SUB STRROW, LOOPCNT 

LDSPI PRMADR; DB = SFENPTR 
INC FENPTR; DPX(-3) < SPFN 
MOV PRMADR, PRMADR; SETMA; 

LDSPI PRMADR; DB = STLPPTR 
INC TLPPTR; DPX(-3) < SPFN 
MOV PRMADR, PRMADR; SETMA; 

LDSPI PRMADR; DB = SINBNDY 
INC INBNDY; DPX(-3) < SPFN 
MOV PRMADR, PRMADR; SETMA; 

LDSPI PRMADR; DB = STIME 
INC TIME; DPX(-3) < SPFN 



"get t'eneme list pointer 
"get tail prob list pointer 
"get start offset 

"save feneme list pointer 
"save tail prob list pointer 
"save start offset 

"get start row 

"get output offset 

"get input boundary pointer 

"save start row 

"save output offset 

"save input boundary pointer 

"get the current time 

"get start distribution length 

"get lexeme length 

. "save current time 
"save start distribution length 
"save lexeme length 

"loop count = lexlen - strrow 

"put back feneme pointer + 1 
MI < DPX(-3) 

"put back tail pointer + 1 
MI < DPX(-3) 

"put back inbndy + 1 
MI < DPX(-3) 

"put back time + 1 
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MOV PRMADR, PRMADR; SETMA; MI < DPX(-3) 

itnittttiiiftttttittitiifiinntitttttf iiiiitiititrfttitttiiiitiMitf If ttitiiiitiinnttittittMittiitiMttitMitiiniitnitiiitiititit 

" Get the next feneme and tall probability. " 

fiititifMnitninniiMtiiMtttiiitiitittirtntitttiiinitfiiiitniititiiititniiiitiititMiiitittititiuitititiiitiittiiitniMtiitii 



OUT; DB = PAGED 
MOV FENPTR, FENPTR; SETMA 
MOV TLPPTR, TLPPTR; SETMA 
OUT; DB = PAGEl 
LDSPI FENEME; DB < MD 
DPY(TPY) < MD 



MOV FENEME. FENEME; DPX(CFENX) < SPFN 



"flip to page zero 
"get next feneme 
"get next tail probability 
"flip back to page one 
"save the feneme 
"save the tall probability 



LDSPI FENLOK; DB = AFENLOK 

ADD FENEME, FENLOK; SETMA 

LDSPI INGOL; DB =» ACOLUMN 

LDSPI OUTCOL; DB = ACOLUMN 

LDSPI FENBAS; DB < MD 

ADD STROFF, INCOL 

ADD STRROW. INCOL 

ADD OUTOFF, OUTCOL 

ADD STRROW. OUTCOL 

LDSPI TRMLOK; DB = ATRMLOK 
ADD STRROW, TRMLOK; SETMA 
DEC OUTCOL 
DEC OUTCOL 
LDSPI TRMPTR; DB < MD 
LDSPI PRMADR; DB = !FFTSZ 
ADD PRMADR, PRMADR 
ADD PRMADR, TRMPTR 



"get base addr of feneme lookup 
"use feneme to get fenbas 
"get base addr for input col 
"get base addr for output col 
"pointer into feneme probs 

"incol = start_of fset + • • . 
" start_row 
"outcol = output^offset + ... 
" start Jrow - 2; 

"get base addr of tram lookup 

"lookup the tram base ptr 

"outcol. . . - 1 

"outcol, . . - 2 

"get .starting tram addr 

"compftnsate for tram base 

"which is at 2 X IFFT5Z 



ttitiiftMiitiiiitittiiiiiiitntiiiiMiititiiMitiiiiiniiMitiiiiiiiinMiiMiiiiiniiiitiiitiMitititiiiiiintniiuiiiiiM 

II 



INSIDE: 



MOV INCOL, INCOL; SETMA 
MOV INBNDY, INBNDY; -SETMA 
SUB/> TIME, STRLEN 
BLT OUTSIDE; 

DPY(LIY) < MD 

FADD DPY(LiY), MD 
FADD 

DPY(HY) <FA 



"get col(start_ptr) 

"get Inbndy(time) in case 

"start_length - time 

"if <= we are outside inbndy 

"and save col(start_ptr) 

"colO + inbndy(time) 
"push the adder 
"last^input =» col + inbndy 

"zero out the running profile 



OUTSIDE: DPY(PROFY) < ZERO 

niiiiiM!tniit«itiiitiMMiii!iirnniiiiiiitii"»Htiiinitiiii!tiiiiiiiii!tnttttiiiiitiiftiMiii?tiittiiiiiMi^ 

" Start the Dioeline eoing. • . start first loop 
liitiimimiiiiimiiiimitHiimiininmfn 



MOV TRMPTR, TRMPTR; SETTMA 
NOP 

LDSPI PHONE; DB < TM; 



"start transfer of phone num 
"wait for table memory 
"save the phone n\unber 
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INCTMA 

ADD FENBAS. PHONE; SETMA 

FADD DPY(LIY), ZERO; 
INCTMA 

FADD; 
INCTMA; 

DPY(SOFAY) < ZERO 



"increment tramptr 

"lookup feneme prob 

"push last_^state thru adder 
"start transfer of null trans 

"push the adder 

"start transfer of phone num 

"clear 2nd oldest fare 



iitiMiiiitiiiMftiiiitntttiMtiiitiiiniMtittiiiiiiiMfiTtttittitilitiittuMinntiitttiMintitiiiiiMnftiititinttitiiriti 

" Start the pipeline going... do first loop " 

iMtiiiitttttMMtitiitttitnittiifitMMfiiMtnttititiititiiittiiitinfttMniiiiiftiiitinitiiiiiitiiiittiiiMtintitfniti 



FMUL TM, FA; 

INC INCOL; SETMA; 
DPY (LIY) < FA; 
FADD DPY(PROFY), FA; 
DPX (FENX) < MD 

FMUL DPX(FENX), DPY(TPY); 
U)SPI PHONE; DB < TM; 
INCTMA; 
FADD 

Fim; 

DPY(PROFY) < FA; 

ADD FENBAS. PHONE; SETMA; 

DPX (FAX) < ZERO; 

FADD. 

FADD FM, MD; 

FMUL TM, DPX(FAX); 
INCTMA; 

DPX(OFAX) < DPY(SOFAY) 

FADD; 

FMUL FM, DPY(LIY); 
INCTMA; 

DPY(SOFAY) < DPX(FAX) 



"null trans * last_lnput 
"start transfer of col{) 
"save last input 
"profile = prof + last_input 



"feneme prob * tail prob 
"save the phone number 
"start trans of self trans 
"push the adder 



"savo the running profile 
"lookup f enema prob 
"zero out forward arc 
"null add 

"colO + null^'last_input 
"self trans * forward arc 
"start transfer of null trans 
"push queue of forward arcs 



"last_input * feneme^prob 
"start transfer of phone num 
"zero out 2nd oldest fare 



HlliitiiiiiiHiiiiiniiiMiniiiiiHtiiHtfiiiiiiiutniitiitMitiiniiiiitiiiiitiiMiiitiiititiinniiiiiiiMMniirM Mttiiiiiituift 

" LooD for index = 1 to loopcnt (= lexeme length - start row) " 

tiittiitiiimiiiiitiiiiiiniiiiiininiiinii!iiitiii«itiiM«!initu«MinnMiniiiniiHiiiiifMiiiiiiiiiiiTfiiM 



EXTLOOP: FMUL TM, FA; 

INC INCOL; SETMA; 
DPY (LIY) < FA; 
FADD DPY(PROFY), FA; 
DPX (FENX) < MD 

FMUL DPX(FENX), DPY(TPY); 
DPY(SELFY) < FM; 
LDSPI PHONE; DB' < TM; 
INCTMA; " 
FADD 



"null trans * last^input 
"start transfer of col() 
"save last input 
"profile ~ prof + last^input 



"feneme prob * tail prob 
"save self arc 
"save the phone number 
"start trans of self trans 
"push the adder 
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FMUL; 

DPY(PROFY) < FA; 

ADD FENBAS, PHONE; SETMA; 

DPX(FAX) < FM; 

FADD DPY(SELFY), DPX(OFAX) 

FADD FM, blD; 

FMUL TM, DPX(FAX); 

INCTMA; 

DEC LOOPCNT; 

DPX(OFAX) < DPY(SOFAY) 



"save the running profile 
"lookup feneme prob 
"save forward arc 
"self + oldest_farc 

"colO + nuU*last_input 
"self trans * forward arc 
"start transfer of null trans 
"decrement the loop count 
"push queue of forward arcs 



FADD; 

RIUL FM, DPY(LIY); 
INCTMA; 

INC OUTCOL; SETMA; 
MI < FA; 

DPY(SOFAY) < DPX(FAX); 
BGT EXTLOOP 



"last^input * feneme^prob 
"start transfer of phono num 
"put out ncxt_output 

"push forward arc queue 
"keep looping until done(BNE) 



iiirritittititiftittttfMtnfiMtttittiiiitttiiiiitMtttitittitniniiititiMitftittiitiMtittiititnitttiitttnitMtitittitiiiin 

" Trail out of loop*.. " 

MttititttfrnfitfiifitMniitMtiitttttitiiitintnitttiititttiiiiiitiitintftttttttiiitMMrttiiitnirttftiitttiiiiittiMiiitittiiiiii 



FMUL TM, FA; 
DPY (LIY) < FA; 
FADD DPY (PROF Y), FA 

FMUL; 

DPY(SELFY) < FM; 

INCTMA; 

FADD 



"calculate last forward arc 
."save last input 
"profile = prof + last_input 

"push the multiplier 
"save sale arc 
"start trans of self trans 
"push the adder 



FMUL; 

DPY(PROFY) < FA; 

DPX(FAX) < FM; 

FADD DPY(SELFY), DPX(OFAX) 

FADD; 

FMUL TM, DPX(FAX); 
DPX(OFAX) < DPY(SOFAY) 

FMUL; 

INC OUTCOL; SETMA; 
MI < FA; 

DPY(SOFAY) < DPX(FAX) 



"save the running profile 
"save forward arc 
"self + 61dest__farc 

"push the adder 

"self trans * forward arc 

"push queue of forward arcs 

"push the multiplier 
"put out next^output 

"push forward arc queue 



Mii!iiiiiMiititiitiii?MiiiiititiiiiiuitiiiitiiifnMMiiiiiiiifiifiiiin»iM""i"«»n"»»M"»"»"»"*»<»"»"""^ 

" Push last outputs out of loop... 

ifiiMttiniiiitMiiuiniifniiiiiitiiiiiiiiiiiiintiiniiitntMini"it«iiiiiniiiiininiiiiniiiiiMiitiii!tni"iti^ 



FMUL 

FADD FM, DPX(OFAX) 
FADD; 

bPX(OFAX) < DPY(SOFAY) 



"push the multiplier 

"self + oldcst_farc 

"push the adder 

"push queue of forward arcs 



INC OUTCOL; SETMA; 



put out next^output 
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MI < FA; 

DPy(SOFAY) < DPX(FAX) 

INC OUTCOL; SETMA; 
MI < DPX(OFAX) 



RETURN 
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"push forward arc queue 



"push out oldest forward arc 
"finished EXTCOL 
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APPENDIX 2 

IttnttlltltlttllltlllltltltltttttltllttMtltllMllftllttttttltlfMf itltltlltttinttltlltltlllttltltlllflllltf lltltlllllttltllllttttlllM 

It " 

" SUBROUTINE APFM " 

" This program implements the acoustic Fast Match in the FPS Array " 
" Processor. This is the modified Fast Match that runs without " 
" explicit length distributions. " 
It - " 

t. 

tiiiiitittttitiiti It itttiiitttttitittt II ttittitftttiitttti till ttittiittntittiiititttf tilt iiitiiitiiititiitiiiiiittitiittttitiit lilt iititt lilt Mti 

iiitiiiiiiiiiiiiiMinitiniitiitiiMititittttiiiifiiiititiiiiiiifiiiiiiitiiiiiifttiiMiiintiinifiiniiiiiiiiitiiiitiinMii^ 
I. 

" SUBROUTINE EVALPP " 
ft 

" This routine performs the actual fast match calculation for the " 
" current lattice node. The main program only calls this routine to " 
" evaluate valid nodes - not the null nodes that correspond to leaves. " 
ti " 

Mil llllllltltlMIMItlfllllllllMttirilfllMl till IIMIIIIIIII It tiiniiiitiiiiiitiniMtititiiitiiiiitti till tiiitiiitiniiM Hit Itit 

tnttiiiiiiiittiiiM till iriiMittitiniiitiiiiitittiMiiiiiiitittintiniiiMMtitiiitit It tiinititt It till till tiiiiiiriiitHiMiittiititttittn^ 
II " 

" Initializations... given the current lattice node number^ look up " 

" the corresponding clink number, set up match parameters such as 

" the length of the start time distribution, pointers to the start 

" time distribution in the boundary stacks, and the offset into the " 

" feneme stream. „ 
It 

iiiiMMiiitttiiiiinitititiniiititMiiiiniit!ittitiHiiiiiiifiiiiiii»»iiiiinHiiiiiiMiiniitiiiiiMiini»iiiiiitiMiii 
ititiiliiiMniiiiitlliiiiiMiiMiilttMitililiniiiiilliitMMillitiiiitiitiiiMiiiiiiininiiitniiiiiiiiiiiiiiittniin 



" Initial zeroes =4: 
II 

" Pad the start time distribution with 4 zeroes, increase SDLEN by 4 

" to simplify looping after start time distribution ends. 
i> 

Initialize output_distribution (time - 1), output_sura, set feneme 

" prob for first time slice equal to zero by clearing the multiplier. 
It 

" output^distribution (0) = 0.0; 

" output^sum =0.0; 

" feneme prob =0.0; 

" state_l =0.0; 

" state_2 =0.0; 

" state 3 =0.0; 

" state"4 =0.0; 
ti *~ 

ititttitiiiittiiiittiiitttntinititiiiitttttittittitiiiHiMitiititttttniiitiitiitittitMttiiHiftiiiMtifiiitiinitiiitniiiii^ 



tt 
II 
II 



ZER04: ADD# SDARY, SDLEN; 

SETMA; 

DPX(USTX) < ZERO; 
DPY(OSY) < ZERO 

INCMA; 
MI < ZERO; 



"point to last sample in the 
"start time distribution 
"zero the last output sample 
"zero the output sum 

"last sample + 1 

"pad out with zero « 
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DPY(STIY) < ZERO; 
INC SDLEN 

INCMA; 

MI < ZERO; 
DPY(ST2Y) < ZERO; 
INC SDLEN 

INCMA; 

MI .< ZERO; 
DPY(ST3Y) < ZERO; 
INC SDLEN 

INCMA; 

MI < ZERO; 
DPY(ST4Y) < ZERO; 
INC SDLEN 

CLR TIME; 

FMUL DPX(LASTX), DPY(OSy) 

MOV SDLEN, LOPLIM; 
FMUL 



state_l =0,0 
"sdlen"= sdlen + 1 

"last sample + 2 
"pad out with zero 
"state_2 =0.0 
"sdlen"= sdlen + 2 

"last sample + 3 
"pad out with zero 
"state_3 =0.0 
"sdlen*= sdlen + 3 

"last sample + 4 
"pad out with zero 
"state_4 =0.0 
"sdlen = sdlen + 4 

"output time counter = 0 
"clear the multiplier 

"1st loop limit = sdlen 
"push the multiplier 



itntittirtiiitttttiiiff iitiitrnilitHirftttniiiftiitiititii ititiiitiiiiititttir>i*titttniiiitintttttftiitiritttitiitiiiiiiiHtiitttiifitifi 

It 

First loop: initial zeroes =4 " 

Calculate output distribution value for current time, update output " 

sum, calculata fencme probability for the next time slice. " 

It 

do time = 1 to start_time_length +4; " 

output^distribution (time) = " 

fenemajrob * (output_distributlon (time - 1) + state^l) ; " 

output_sxim = output__sum + output^distribution (time); " 

state^l = state_2 * fenemejprob; " 

state_2 = state_3 * fenemejprob; " 

state_3 = state_4 * feneroe_prob; " 

stat6_4 =5 st^array (time); " 

fenemejprob = fd^array (local_buf fer(f irst^feneme + time)) " 

* tail buffer (first fenerae + time); " 

end; " " " 

It 

itttiMiittltiittinitiiMifiiffttfiitiitintiiiittttiittviitiiiiiniiititMtinittinifitfitMitttiitiiifiiMtlttftiiittffttiin 



L41: 



INC LFARY; 
SETMA; 

FAED DPX(LASTX), DPY(STIY); 
FMUL 

INC TPARY; 
SETMA; 

FMUL FM, DPY(ST2Y); 
FADD; 

DPX(FPX) < FM 

INC SDARY; 
SETMA; 

FMUL DPX(FPX), FA 



"start transfer of next 
"feneme symbol from stream 
"add last output + state^l 
"push the multiplier 

"start transfer of next 
"tail probability 
"statej2 * feneme_prob 
"push the adder pipeline 
"save fenemejprob 

"transfer next starting pt 

"(last+stl) * fenemejprob 
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LDSPI FDOFF; 
DB a MD; 

FMUL DPX(FPX), DPY(ST3Y) 

ADD// FDOFF, FDARY; 
SETMA; 

DPX(TPX) < MD; 

FMUL DPX<FPX), DPY(ST4Y); 

DPY(STIY) < FM 

FMUL; 

DPX(USTX) < FM; 
DPY(ST4Y) < MD 

INC TIME; 
FMUL; 

DPY(ST2Y) < FM; 

FADD DPX(LASTX). DPY(OSY) 

DEC LOPLIM; 

FMUL DPX.(TPX), MD; 
DPY(ST3Y) < FM; 
■ FADD 

BGT L41; 
FMUL; 

DPY(OSY) < FA; 
INCTMA; 

DB = DPX(LASTX); 
OUT 



"get the current feneme 
"symbol from bus 
"state_3 * fenenie_prob 

"start transfer of next 
"fenetne probability 
"save next tail probability 
"state_4 * feneme^prob 
"state^l =» state_2 * fp 

"push the multiplier 
"save output sample 
"store next input sample 

"update the time coxmter 
"push the multiplier 
"state_2 = state_3 * fp 
"add output to output^sum 

"at end of loop? 
"feneme * tail prob 
"state_3 = state_4 * fp 
"push the adder 

"keep looping if not done 
"push the multiplier 
"save output_sum 
"update pointer to scratch 
"area for output dist, save 
"the output sample 



Hit»ii!iiiiMitiiiitfifititiit«tiiiiiiMninii!itiiitiiiiiMitiiinitiiiiitiiiMiitiuiitiiiiiiiiin"ittiiiitnin 

II 

It 



Second loop: „ 
11 

" Time is now equal to start_tiroe_length + initial_zeroes, so the | 

" start time distribution has ended and all internal states are 

" equal to zero. Therefore, this section of code is common to all 

" cases of initial zeroes. „ 

" Loop until time_limit (start_time_length + Id^length - 1), or until " 

" the output falls below loop_cutoff . 

It 

II 

" do time = start time length + 1 + initial^zeroes to time^limit '* 

" whil6"(output_distribution (time) >= loop_cutof f ) ; 

II 

" output distribution (time) = 

»» *" feneme_prob * output_distribution (time - 1); 

" output sum = output_sum + output_distribution (time); || 

" feneme^prob = fd_array (local^buf f er(f irst^feneme + time)) 

•» * tail_buffer (f irst^feneme + time); 

•" end; „ 
II 

IMIIItMMIItlMtllllinunilllllllMinilfllltMtttltltllltlltlllltllMllltlttlltlilllltlHItnillltltMl^ 



EVAL2: 



MOV LDLEN, LOPLIM 



"loop limit = Id^len - 1 
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SUB INTZERO. LOPLIM 
DEC LOPLIM 
BGT L42 
JMP EVOUT 

L42: INC LFARY; 

SETMA; 
FMUL 

INC TPARY; 
SETMA* 

FMUL FM, DPX(LASTX) 
FMUL 

LDSPI FDOFF; 
DB « MD; 
FMUL 

ADD/^ FDOFF, FDARY; 
SETMA; 

DPX(TPX) < MD; 
FSUBR FM, DPY(LCY) 

FADD FM, DPY(OSY); 
DPX(LASTX)<FM 

INC TIME; 
FADD 

DEC LOPLIM; 

Fl-IUL DPXCTPX),. HD; 
BFGE EVOUT; 
DPY(OSY)<FA 

BGT L42; 
FMUL; 
INCTMA; 

DB = DPX(LASTX) ; 
OUT 



" - initial_zeroes 

"if looplirait > 0, do it 
"otherwisa, jump to exit 

"start transfer of next 
"fenerae symbol from stream 
"push the multiplier 

"get next tail prob 
"from buffer 

"last a last * fenemejprob 

"push the multiplier 

"get feneme symbol 
"from input stream 
"push the multiplier 

"use feneme as pointer into 
"probability array 
"save the tail probability 
"compare output to cutoff 

"add output into output_sum 
"save output in register 

"increment cime counter 
"push the adder 

"see if we are done yet 
"feneme tail probability 
"quit if output < cutoff 
"save output sum 

"if not done, keep looping 
"push the multiplier 
"update pointer to scratch 
"area for output dist, save 
"the output sample 
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The QttodiiT:Bnts of tlie invention in x^rhich an exclusive 
property or privilege is clained are defined as follows: 



1 1. In a speoch recognition system, a method of measuring 

2 the likelihood of a word corresponding to a spoken input 

3 where the word is from a vocabulary of words, the method 

4 comprising the steps of: 

5 (a) generating a string of labels in response to a spoken 

6 input, each label (1) being from an alphabet of labels 

7 and (ii) representing a respective sound type; 

8 (b) determining label votes, each label vote represent- 

9 ing the likelihood that a respective label is produced 

10 when a given word is uttered; and 

11 (c) for a subject word, accumulating a label vote for 
X2 each of at least some of the labels generated in the 

13 string; 

14 the accumulated label votes providing information in- 

15 dicative of the likelihood of the subject word. 



2. The method of Claim 1 comprising the further step of: 

(d) combining the accumulated label votes together to 
provide a likelihood score for the subject word. 

3. The method of Claim 2 comprising the further step of; 

(e) repeating steps (c) and (d) for each word in the 
vocabulary; and 



9^ 
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Claim 3 (continued) 

4 (f) selecting the n words having the highest likelihood 

5 scores as candidate words, where n is a predefined in- 

6 teger. 

1 4-. The method of Claim 2 wherein the vote determining 

2 step includes the step of (g) evaluating the log proba- 

3 bility of the subject word producing the respective la- 

4 bel; and 

^ wherein the combining step includes the step of (h) 

^ summing up the determined label votes for the subject 

^ word ; and 

® wherein the method comprises the further step of: 

9 (j) determining, a scaled likelihood score for the sub- 

jl^Q ject word as the sum of votes divided by the number of 
generated labels that are summed. 

I 5. The method of Claim 4 comprising the further steps 
2 

3 (k) repeating steps (c) , (d) , (g), (h), and (j) for each 

^ word in the vocabulary; and 

^2 (1) selecting as candidate words the n words having the 

g highest scaled likelihood scores where n is a predefined 
J integer. 

1 6. The method of Claim 1 comprising the further step of: 
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Claim 6 (continued) 

(m) determining, for a subject word, a penalty for each 
label wherein the penalty for a given label reprusents 
the likelihood of the subject word not producing the 
given label; and 

(n) combining together (i) the label votes and (il) the 
penalties corresponding to the subject word, wherein 
each combined label vote corresponds to a label gener- 
ated in response to the spoken input and wherein each 
combined penalty corresponds to a label not generated 
in response to the spoken input. 

7. The method of Claim 6 comprising the further steps 
of: 

(o) repeating steps (a), (b) , (c), (m) , and (n) for each 
word as the subject word. 

8. The method of Claim 7 comprising the further step of: 

(p) determining the label in the string that corresponds 
to the most likely end time of a word-representLag ut- 
terance; and 

(q) limiting the determining of votes and penalties to 
labels generated prior to the determined end time label. 

9. The method of Claim 7 comprising the further step of: 
(r) setting a reference end time; and 




N 
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3 (s) repeating th*- combining step (n) at successive in- 

4 tervals following the set reference end time. 

1 10. The method of Claim 1 wherein the vote determining 

2 step (b) includes the steps of: 

3 (t) forming each word as a sequence of models each model 

4 being represented as a Markov model phone machine char- 

5 acterized as having (1) a jjlurality of states, (li) a 

6 plurality of transitions extending from a state to a 

7 state, (ili) first means for storing a likelihood count 

8 for the occurrence of each transition, and (iv) second 

9 means for storing a likelihood count for each of at least 
10 some of the labels being produced at each of at least * 
XX some of the transitions; 

12 W determining transition likelihood counts and label 

13 likelihood counts from training data generated by the 

14 utterance of known speech input; and 

15 (v) producing an expected label distribution for each 
15 word based on the determined transition likelihood 
17 counts and label likelihood counts, each label vote be- 
13 ing derived from the expected label distribution. 

11. In a speech recognition system having (i) an acous- 

2 tic processor which generates acoustic labels/ (ii) a 

2 vocabulary of words each of which is represented by a 

^ word model comprising a sequence of Markov model phone 

5 machines, and (iii) a set of trained statistics indi- 

PS' 
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Claim 11 (continued) 

6 eating the label output probabilities and transition 

7 probabilities of each phone machine in a word model, a 

8 method of selecting likely candidate words from the vo- 

9 cabulary comprising the steps of: 

10 (a) for a subject word from the vocabulary, determining 

11 a respective label vote for each label wherein a label 
' 12 vote for a given label represents the likelihood of the 
; 13 subject word producing the given label; and 

: 14 (b) for a given string of labels generated by the 

15 acoustic processor in response to an unknown spoken In- 

! 16 piit;, combining the label votes for the subject word for 

17 labels generated in the string. 

I 2. method of Claim 11 wherein the vote determining 

2 step Includes the steps of: 

I 3 (c) computing the expected number and distribution of 

4 labels for the subject word from the trained statistics; 

5 and 

5 (d) computing the logarithmic probability distribution 

; J of labels for the subject word from the expected dls- 

3 tributlon of labels; 

; g the logarithmic probability of the subject word produc- 

: j^Q ing a given label being the label vote of the given label 

; j^2^ for the subject word. 
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1 13. The method of Claim 12 comprising the further steps 

2 of: 

3 (e) determining the likelihood of each label not being 

4 produced by the subject word as a label penalty for the 

5 subject word; 

6 (f) applying a length of the label string to the subject 

7 word; 

8 (g) for each label not occurring in the applied length, 

9 extracting the label penalty for the subject word; and 

10 (h) combining the votes for all labels occurring in the 

• 11 applied length with the penalties for all labels not 

12 occurring in the applied length to form a likelihood 

13 score for the subject word. 

1 14. The method of Claim 13 comprising the further step 

2 of ; 

3 (j) dividing the likelihood score for the subject word 
^ by the number of labels in the applied length to provide . 
^ a scaled likelihood score. 

1 15. The method of Claim 14 comprising the further step 

, - 2 of: 

3 (k) repeating steps (a) through (h) for each word in the 

^ vocabulary as the subject word; and 
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Claim 15 (continued) 

5 (1) selecting as candidate words those words having the 

6 n highest scaled likelihood scores. 

1 16. The method of Claim 15 wherein the applying step 

2 includes the step of: 

3 (m) setting a reference end time; 

4 (n) repeating the combining step (h) at successive in- 

5 tervals following the set reference end time; and 

6 (o) determining, for each word at each successive in- 

7 terval, a scaled likelihood score relative to the scaled 
3 likelihood scores of the other words at a given interval 
9 and assigning to each subject word, the highest scaled 

IQ relative likelihood score thereof. 

I 17, The method of Claim 16 comprising the further step 
2 

. 3 (p) performing an approximate acoustic match on all 

* ^" words in a set of words from the vocabulary determined 

5 from step (1). 

^ 18. The method of Claim 17 wherein each phone machine 

' 2 associated therewith (i) at least one transition and 

y (ii) actual label output probabilities, each actual la- 

^ bel probability representing the probability that a 

^ specific label is generated at a given transition in the 

t9 
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Claim 18 (continued) 

6 phone machine, and wherein the acoustic match performing 

7 step includes the steps of: 

8 (aa) forming simplified phone machines which includes 

9 the step of replacing by a single specific value the 

10 actual label probabilities for a given label at all 

11 transitions at which the given label may be generated 
•12 in a particular phone machine; and 

12 Cbb) determining the probability of a phone generating 
ij^^ the labels in the string based on the simplified phone 

machine corresponding thereto. 

19. A method as in Claim 18 wherein each sequence of 

: 2 phones corresponding to a vocabulary word represents a 

: 2 phonetic baseform, the method including the further step 

I A of: 

; 2 (cc) arranging the baseforms into a tree structure in 

; g ^ which baseforms share a common branch for as long the 

J baseforms have at least similar phonetic beginnings, 

8 each leaf of the tree structure corresponding to a com- 

9 plete baseform. 

; 1 20. A method as in Claim 18 wherein the specific value 

! 2 assigned to all label probabilities for a particular 

' ^ 3 label in a given phone machine is equal to at least the 

4 maximum actual label probability over all transitions 

5 for said particular label in the given phone machine. 

99 
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21. A method as in Claim 20 wherein each phone machine 
also has associated therewith (c) a length distribution 
foi each phone which indicates probability as a function 
of the number of labels produced by a particular phone, 
the method including the further step of: 

(dd) converting the length distribution for each phone 
into a uniform probability length pseudo -distribution 
which includes the step of ascribing a uniform value to 
the probability of a given phone producing any number 
of labels defined in the length distribution. 

22. A method as in Claim 21 comprising the further steps 
of: 

(ee) finding the minimum length of labels having a non- 
zero probability; and 

(ff) confining the uniform • pseudo-distribution to 
lengths at least as lonjg as the minimum length, lengths 
less than the minimum length being assigned a probabil- 
ity of zero. 

23. The method of Claim 12 wherein the combining step 
includes the steps of: 

(gg) summing the votes of labels in the string for the 
subject word. 

24. A speech recognition method of selecting likely 
words from a vocabulary of words wherein each word is 
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Claim 24 (continued) 

3 represented by a sequence of at least one probabilistic 

4 finite state phone machine and wherein an acoustic 

5 processor generates acoustic labels in response to a 

6 spoken input, the method comprising the steps of: 

7 (a) forming a first table in which each label in the 

8 alphabet provides a vote for each word in the vocabu- 

9 lary, each label vote for a subject word indicating the 
IQ likelihood of the subject word producing the label pro- 
11 viding the vote. . . 

1 25. The method of claim 24 comprising the further 

2 steps of: 

3 (b) forming a second table in which each label is as- 
^ signed a penalty for each word in the vocabulary, the 
^ penalty assigned to a given label for a given word being 
^ indicative of the likelihood of the given label not be- 
y ing produced according to the model for the given word. 

^ 26, The method of Claim 24 comprising the further step 

2 of: 

^ (c) for a given string of labels, determining the like- 

^ lihood of a particular word which includes the step of 

^ combining the votes of all labels in the string for the 

g particular word. 

1 27. The method of Claim 25 comprising the further step . 

2 of: 
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Claim 27 (continued) 

(d) for a given string of labels, dettirmining the like- 
lihood of a particular word which includes the step of 
combining the votes of all labels in the string for the 
particular word together with the penalties of all la- 
bels not in the string for the particular word. 

28. The method of Claim 27 comprising the further step 
of: 

(e) repeating steps (a), (b), and (c) for all words as 
the particular word in order to provide a likelihood 
score for each word. 

29. A speech recognition apparatus for selecting likely 
words from a vocabulary of words wherein each word is 
represented by a sequence of at least one probabilistic 
finite state phone machine and wherein an acoustic 
processor generates acoustic labels in response to a 
spoken input, the apparatus comprising: 

(a) weans for forming a first table in which each label 
in the alphabet provides a vote for each word in the 
vocabulary, each label vote for a subject word indicat- 
ing the likelihood of the subject word producing the 
label providing the vote'^ 

(b) means for forming a second table in which each label 
is assigned a penalty for each word in the vocabulary, 
the penalty assigned to a given label for a given word 
being indicative of the likelihood of the given label 
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Claim 29 (continued) 

16 not being oroduced according to the model for the given 

17 word; and 

18 (c) means for determining, for a given string of labels, 

19 the likelihood of a particular word which includes means 

20 for combining the votes of all labels in the string for 

21 the particular word together with the penalties of all 

22 labels not in the string for the particular word. 
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